The course project is based on the Home Credit Default Risk (HCDR) Kaggle Competition. The goal of this project is to predict whether or not a client will repay a loan. In order to make sure that people who struggle to get loans due to insufficient or non-existent credit histories have a positive loan experience, Home Credit makes use of a variety of alternative data--including telco and transactional information--to predict their clients' repayment abilities.
Kaggle is a Data Science Competition Platform which shares a lot of datasets. In the past, it was troublesome to submit your result as your have to go through the console in your browser and drag your files there. Now you can interact with Kaggle via the command line. E.g.,
! kaggle competitions files home-credit-default-risk
It is quite easy to setup, it takes me less than 15 minutes to finish a submission.
kaggle.json filekaggle.json in the right placeFor more detailed information on setting the Kaggle API see here and here.
!pip install kaggle
Defaulting to user installation because normal site-packages is not writeable
Requirement already satisfied: kaggle in /usr/local/lib/python3.7/site-packages (1.5.12)
Requirement already satisfied: requests in /usr/local/lib/python3.7/site-packages (from kaggle) (2.25.1)
Requirement already satisfied: certifi in /usr/local/lib/python3.7/site-packages (from kaggle) (2021.5.30)
Requirement already satisfied: python-dateutil in /usr/local/lib/python3.7/site-packages (from kaggle) (2.8.2)
Requirement already satisfied: tqdm in /usr/local/lib/python3.7/site-packages (from kaggle) (4.62.1)
Requirement already satisfied: six>=1.10 in /usr/local/lib/python3.7/site-packages (from kaggle) (1.15.0)
Requirement already satisfied: urllib3 in /usr/local/lib/python3.7/site-packages (from kaggle) (1.26.6)
Requirement already satisfied: python-slugify in /usr/local/lib/python3.7/site-packages (from kaggle) (5.0.2)
Requirement already satisfied: text-unidecode>=1.3 in /usr/local/lib/python3.7/site-packages (from python-slugify->kaggle) (1.3)
Requirement already satisfied: idna<3,>=2.5 in /usr/local/lib/python3.7/site-packages (from requests->kaggle) (2.10)
Requirement already satisfied: chardet<5,>=3.0.2 in /usr/local/lib/python3.7/site-packages (from requests->kaggle) (4.0.0)
WARNING: You are using pip version 21.2.4; however, version 21.3.1 is available.
You should consider upgrading via the '/usr/local/bin/python -m pip install --upgrade pip' command.
!pwd
/N/home/u100/vshriram/Carbonate/I526_AML_Student/Assignments/Unit-Project-Home-Credit-Default-Risk
!mkdir ~/.kaggle
#!cp /N/u/vshriram/Carbonate/Downloads/kaggle.json ~/.kaggle
!chmod 600 ~/.kaggle/kaggle.json
mkdir: cannot create directory '/N/u/vshriram/Carbonate/.kaggle': File exists
! kaggle competitions files home-credit-default-risk
name size creationDate ---------------------------------- ----- ------------------- credit_card_balance.csv 405MB 2019-12-11 02:55:35 bureau_balance.csv 358MB 2019-12-11 02:55:35 application_test.csv 25MB 2019-12-11 02:55:35 POS_CASH_balance.csv 375MB 2019-12-11 02:55:35 HomeCredit_columns_description.csv 37KB 2019-12-11 02:55:35 bureau.csv 162MB 2019-12-11 02:55:35 application_train.csv 158MB 2019-12-11 02:55:35 previous_application.csv 386MB 2019-12-11 02:55:35 sample_submission.csv 524KB 2019-12-11 02:55:35 installments_payments.csv 690MB 2019-12-11 02:55:35
Many people struggle to get loans due to insufficient or non-existent credit histories. And, unfortunately, this population is often taken advantage of by untrustworthy lenders.
Home Credit strives to broaden financial inclusion for the unbanked population by providing a positive and safe borrowing experience. In order to make sure this underserved population has a positive loan experience, Home Credit makes use of a variety of alternative data--including telco and transactional information--to predict their clients' repayment abilities.
While Home Credit is currently using various statistical and machine learning methods to make these predictions, they're challenging Kagglers to help them unlock the full potential of their data. Doing so will ensure that clients capable of repayment are not rejected and that loans are given with a principal, maturity, and repayment calendar that will empower their clients to be successful.
Home Credit is a non-banking financial institution, founded in 1997 in the Czech Republic.
The company operates in 14 countries (including United States, Russia, Kazahstan, Belarus, China, India) and focuses on lending primarily to people with little or no credit history which will either not obtain loans or became victims of untrustworthly lenders.
Home Credit group has over 29 million customers, total assests of 21 billions Euro, over 160 millions loans, with the majority in Asia and and almost half of them in China (as of 19-05-2018).
While Home Credit is currently using various statistical and machine learning methods to make these predictions, they're challenging Kagglers to help them unlock the full potential of their data. Doing so will ensure that clients capable of repayment are not rejected and that loans are given with a principal, maturity, and repayment calendar that will empower their clients to be successful.
There are 7 different sources of data:
# 
Create a base directory:
DATA_DIR = "../../../Data/home-credit-default-risk" #same level as course repo in the data directory
Please download the project data files and data dictionary and unzip them using either of the following approaches:
Download button on the following Data Webpage and unzip the zip file to the BASE_DIRDATA_DIR = "/N/u/vshriram/Carbonate/I526_AML_Student/Assignments/Unit-Project-Home-Credit-Default-Risk" #same level as course repo in the data directory
#DATA_DIR = "/root/shared/I526_AML_Student/Data/home-credit-default-risk"
#DATA_DIR = os.path.join('./ddddd/')
!mkdir $DATA_DIR
mkdir: cannot create directory '/N/u/vshriram/Carbonate/I526_AML_Student/Assignments/Unit-Project-Home-Credit-Default-Risk': File exists
!ls -l $DATA_DIR
total 7161760 -rw-rw-r-- 1 vshriram vshriram 11889867 Dec 6 17:22 Group21_Phase1(1)(1).ipynb -rw-rw-r-- 1 vshriram vshriram 11283226 Dec 12 18:19 Group21_Phase2 (2).ipynb -rw-r--r-- 1 vshriram vshriram 20907849 Dec 12 00:34 Group21_Phase2.ipynb -rw-rw-r-- 1 vshriram vshriram 5428396 Dec 7 17:07 Group23_Phase2_AML.ipynb drwxrwxr-x 3 vshriram vshriram 32768 Nov 29 20:47 HCDR_Phase_1_baseline_submission -rw-rw-r-- 1 vshriram vshriram 875737 Dec 7 14:06 HW10_End_to_end_Machine_Learning_Project (1).html -rw-rw-r-- 1 vshriram vshriram 37383 Dec 11 2019 HomeCredit_columns_description.csv -rw-rw-r-- 1 vshriram vshriram 392703158 Dec 11 2019 POS_CASH_balance.csv -rw-r--r-- 1 vshriram vshriram 21462 Dec 6 21:37 Untitled.ipynb -rw-rw-r-- 1 vshriram vshriram 26567651 Dec 11 2019 application_test.csv -rw-rw-r-- 1 vshriram vshriram 166133370 Dec 11 2019 application_train.csv -rw-rw-r-- 1 vshriram vshriram 170016717 Dec 11 2019 bureau.csv -rw-rw-r-- 1 vshriram vshriram 375592889 Dec 11 2019 bureau_balance.csv -rw-rw-r-- 1 vshriram vshriram 424582605 Dec 11 2019 credit_card_balance.csv -rw-r--r-- 1 vshriram vshriram 721616255 Dec 4 11:40 home-credit-default-risk.zip -rw-rw-r-- 1 vshriram vshriram 723118349 Dec 11 2019 installments_payments.csv -rw-rw-r-- 1 vshriram vshriram 404973293 Dec 11 2019 previous_application.csv -rw-rw-r-- 1 vshriram vshriram 536202 Dec 11 2019 sample_submission.csv -rw-r--r-- 1 vshriram vshriram 1325792 Dec 7 20:46 submission.csv -rw-r--r-- 1 vshriram vshriram 1278347733 Dec 6 21:28 train.pkl -rw-r--r-- 1 vshriram vshriram 1297567752 Dec 7 19:40 train2.pkl -rw-r--r-- 1 vshriram vshriram 1298414174 Dec 7 20:02 train3.pkl
! kaggle competitions download home-credit-default-risk -p $DATA_DIR
Downloading home-credit-default-risk.zip to /N/u/vshriram/Carbonate/I526_AML_Student/Assignments/Unit-Project-Home-Credit-Default-Risk 99%|████████████████████████████████████████▋| 682M/688M [00:06<00:00, 140MB/s] 100%|█████████████████████████████████████████| 688M/688M [00:06<00:00, 115MB/s]
pip install missingno
Defaulting to user installation because normal site-packages is not writeable
Requirement already satisfied: missingno in /N/home/u100/vshriram/Carbonate/.local/lib/python3.7/site-packages (0.5.0)
Requirement already satisfied: scipy in /usr/local/lib/python3.7/site-packages (from missingno) (1.6.2)
Requirement already satisfied: numpy in /usr/local/lib/python3.7/site-packages (from missingno) (1.19.5)
Requirement already satisfied: matplotlib in /usr/local/lib/python3.7/site-packages (from missingno) (3.4.2)
Requirement already satisfied: seaborn in /usr/local/lib/python3.7/site-packages (from missingno) (0.11.2)
Requirement already satisfied: pillow>=6.2.0 in /usr/local/lib/python3.7/site-packages (from matplotlib->missingno) (8.3.1)
Requirement already satisfied: kiwisolver>=1.0.1 in /usr/local/lib/python3.7/site-packages (from matplotlib->missingno) (1.3.1)
Requirement already satisfied: pyparsing>=2.2.1 in /usr/local/lib/python3.7/site-packages (from matplotlib->missingno) (2.4.7)
Requirement already satisfied: cycler>=0.10 in /usr/local/lib/python3.7/site-packages (from matplotlib->missingno) (0.10.0)
Requirement already satisfied: python-dateutil>=2.7 in /usr/local/lib/python3.7/site-packages (from matplotlib->missingno) (2.8.2)
Requirement already satisfied: six in /usr/local/lib/python3.7/site-packages (from cycler>=0.10->matplotlib->missingno) (1.15.0)
Requirement already satisfied: pandas>=0.23 in /usr/local/lib/python3.7/site-packages (from seaborn->missingno) (1.3.2)
Requirement already satisfied: pytz>=2017.3 in /usr/local/lib/python3.7/site-packages (from pandas>=0.23->seaborn->missingno) (2021.1)
WARNING: You are using pip version 21.2.4; however, version 21.3.1 is available.
You should consider upgrading via the '/usr/local/bin/python -m pip install --upgrade pip' command.
Note: you may need to restart the kernel to use updated packages.
pip install tensorboard
Defaulting to user installation because normal site-packages is not writeable
Requirement already satisfied: tensorboard in /usr/local/lib/python3.7/site-packages (2.6.0)
Requirement already satisfied: werkzeug>=0.11.15 in /usr/local/lib/python3.7/site-packages (from tensorboard) (2.0.1)
Requirement already satisfied: tensorboard-data-server<0.7.0,>=0.6.0 in /usr/local/lib/python3.7/site-packages (from tensorboard) (0.6.1)
Requirement already satisfied: google-auth-oauthlib<0.5,>=0.4.1 in /usr/local/lib/python3.7/site-packages (from tensorboard) (0.4.5)
Requirement already satisfied: google-auth<2,>=1.6.3 in /usr/local/lib/python3.7/site-packages (from tensorboard) (1.35.0)
Requirement already satisfied: setuptools>=41.0.0 in /usr/local/lib/python3.7/site-packages (from tensorboard) (52.0.0.post20210125)
Requirement already satisfied: tensorboard-plugin-wit>=1.6.0 in /usr/local/lib/python3.7/site-packages (from tensorboard) (1.8.0)
Requirement already satisfied: markdown>=2.6.8 in /usr/local/lib/python3.7/site-packages (from tensorboard) (3.3.4)
Requirement already satisfied: absl-py>=0.4 in /usr/local/lib/python3.7/site-packages (from tensorboard) (0.13.0)
Requirement already satisfied: grpcio>=1.24.3 in /usr/local/lib/python3.7/site-packages (from tensorboard) (1.39.0)
Requirement already satisfied: wheel>=0.26 in /usr/local/lib/python3.7/site-packages (from tensorboard) (0.37.0)
Requirement already satisfied: protobuf>=3.6.0 in /usr/local/lib/python3.7/site-packages (from tensorboard) (3.17.3)
Requirement already satisfied: requests<3,>=2.21.0 in /usr/local/lib/python3.7/site-packages (from tensorboard) (2.25.1)
Requirement already satisfied: numpy>=1.12.0 in /usr/local/lib/python3.7/site-packages (from tensorboard) (1.19.5)
Requirement already satisfied: six in /usr/local/lib/python3.7/site-packages (from absl-py>=0.4->tensorboard) (1.15.0)
Requirement already satisfied: rsa<5,>=3.1.4 in /usr/local/lib/python3.7/site-packages (from google-auth<2,>=1.6.3->tensorboard) (4.7.2)
Requirement already satisfied: cachetools<5.0,>=2.0.0 in /usr/local/lib/python3.7/site-packages (from google-auth<2,>=1.6.3->tensorboard) (4.2.2)
Requirement already satisfied: pyasn1-modules>=0.2.1 in /usr/local/lib/python3.7/site-packages (from google-auth<2,>=1.6.3->tensorboard) (0.2.8)
Requirement already satisfied: requests-oauthlib>=0.7.0 in /usr/local/lib/python3.7/site-packages (from google-auth-oauthlib<0.5,>=0.4.1->tensorboard) (1.3.0)
Requirement already satisfied: importlib-metadata in /usr/local/lib/python3.7/site-packages (from markdown>=2.6.8->tensorboard) (3.10.0)
Requirement already satisfied: pyasn1<0.5.0,>=0.4.6 in /usr/local/lib/python3.7/site-packages (from pyasn1-modules>=0.2.1->google-auth<2,>=1.6.3->tensorboard) (0.4.8)
Requirement already satisfied: certifi>=2017.4.17 in /usr/local/lib/python3.7/site-packages (from requests<3,>=2.21.0->tensorboard) (2021.5.30)
Requirement already satisfied: urllib3<1.27,>=1.21.1 in /usr/local/lib/python3.7/site-packages (from requests<3,>=2.21.0->tensorboard) (1.26.6)
Requirement already satisfied: idna<3,>=2.5 in /usr/local/lib/python3.7/site-packages (from requests<3,>=2.21.0->tensorboard) (2.10)
Requirement already satisfied: chardet<5,>=3.0.2 in /usr/local/lib/python3.7/site-packages (from requests<3,>=2.21.0->tensorboard) (4.0.0)
Requirement already satisfied: oauthlib>=3.0.0 in /usr/local/lib/python3.7/site-packages (from requests-oauthlib>=0.7.0->google-auth-oauthlib<0.5,>=0.4.1->tensorboard) (3.1.1)
Requirement already satisfied: typing-extensions>=3.6.4 in /usr/local/lib/python3.7/site-packages (from importlib-metadata->markdown>=2.6.8->tensorboard) (3.7.4.3)
Requirement already satisfied: zipp>=0.5 in /usr/local/lib/python3.7/site-packages (from importlib-metadata->markdown>=2.6.8->tensorboard) (3.5.0)
WARNING: You are using pip version 21.2.4; however, version 21.3.1 is available.
You should consider upgrading via the '/usr/local/bin/python -m pip install --upgrade pip' command.
Note: you may need to restart the kernel to use updated packages.
import numpy as np
import pandas as pd
from sklearn.preprocessing import LabelEncoder
import os
import zipfile
from sklearn.base import BaseEstimator, TransformerMixin
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import GridSearchCV
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import MinMaxScaler
from sklearn.pipeline import Pipeline, FeatureUnion, make_pipeline
from pandas.plotting import scatter_matrix
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import OneHotEncoder
import missingno as msno
from sklearn.metrics import mean_squared_error
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.compose import ColumnTransformer, make_column_transformer
from sklearn.metrics import roc_curve
from sklearn.metrics import roc_auc_score
from sklearn.metrics import accuracy_score
import warnings
warnings.filterwarnings('ignore')
unzippingReq = False
if unzippingReq: #please modify this code
zip_ref = zipfile.ZipFile('application_train.csv.zip', 'r')
zip_ref.extractall('datasets')
zip_ref.close()
zip_ref = zipfile.ZipFile('application_test.csv.zip', 'r')
zip_ref.extractall('datasets')
zip_ref.close()
zip_ref = zipfile.ZipFile('bureau_balance.csv.zip', 'r')
zip_ref.extractall('datasets')
zip_ref.close()
zip_ref = zipfile.ZipFile('bureau.csv.zip', 'r')
zip_ref.extractall('datasets')
zip_ref.close()
zip_ref = zipfile.ZipFile('credit_card_balance.csv.zip', 'r')
zip_ref.extractall('datasets')
zip_ref.close()
zip_ref = zipfile.ZipFile('installments_payments.csv.zip', 'r')
zip_ref.extractall('datasets')
zip_ref.close()
zip_ref = zipfile.ZipFile('POS_CASH_balance.csv.zip', 'r')
zip_ref.extractall('datasets')
zip_ref.close()
zip_ref = zipfile.ZipFile('previous_application.csv.zip', 'r')
zip_ref.extractall('datasets')
zip_ref.close()
import numpy as np
import pandas as pd
from sklearn.preprocessing import LabelEncoder
import os
import zipfile
from sklearn.base import BaseEstimator, TransformerMixin
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import GridSearchCV
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import MinMaxScaler
from sklearn.pipeline import Pipeline, FeatureUnion
from pandas.plotting import scatter_matrix
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import OneHotEncoder
import warnings
warnings.filterwarnings('ignore')
def load_data(in_path, name):
df = pd.read_csv(in_path)
print(f"{name}: shape is {df.shape}")
print(df.info())
display(df.head(5))
return df
datasets={} # lets store the datasets in a dictionary so we can keep track of them easily
ds_name = 'application_train'
datasets[ds_name] = load_data(os.path.join(DATA_DIR, f'{ds_name}.csv'), ds_name)
datasets['application_train'].shape
application_train: shape is (307511, 122) <class 'pandas.core.frame.DataFrame'> RangeIndex: 307511 entries, 0 to 307510 Columns: 122 entries, SK_ID_CURR to AMT_REQ_CREDIT_BUREAU_YEAR dtypes: float64(65), int64(41), object(16) memory usage: 286.2+ MB None
| SK_ID_CURR | TARGET | NAME_CONTRACT_TYPE | CODE_GENDER | FLAG_OWN_CAR | FLAG_OWN_REALTY | CNT_CHILDREN | AMT_INCOME_TOTAL | AMT_CREDIT | AMT_ANNUITY | ... | FLAG_DOCUMENT_18 | FLAG_DOCUMENT_19 | FLAG_DOCUMENT_20 | FLAG_DOCUMENT_21 | AMT_REQ_CREDIT_BUREAU_HOUR | AMT_REQ_CREDIT_BUREAU_DAY | AMT_REQ_CREDIT_BUREAU_WEEK | AMT_REQ_CREDIT_BUREAU_MON | AMT_REQ_CREDIT_BUREAU_QRT | AMT_REQ_CREDIT_BUREAU_YEAR | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 100002 | 1 | Cash loans | M | N | Y | 0 | 202500.0 | 406597.5 | 24700.5 | ... | 0 | 0 | 0 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 |
| 1 | 100003 | 0 | Cash loans | F | N | N | 0 | 270000.0 | 1293502.5 | 35698.5 | ... | 0 | 0 | 0 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
| 2 | 100004 | 0 | Revolving loans | M | Y | Y | 0 | 67500.0 | 135000.0 | 6750.0 | ... | 0 | 0 | 0 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
| 3 | 100006 | 0 | Cash loans | F | N | Y | 0 | 135000.0 | 312682.5 | 29686.5 | ... | 0 | 0 | 0 | 0 | NaN | NaN | NaN | NaN | NaN | NaN |
| 4 | 100007 | 0 | Cash loans | M | N | Y | 0 | 121500.0 | 513000.0 | 21865.5 | ... | 0 | 0 | 0 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
5 rows × 122 columns
(307511, 122)
ds_name = 'application_test'
datasets[ds_name] = load_data(os.path.join(DATA_DIR, f'{ds_name}.csv'), ds_name)
application_test: shape is (48744, 121) <class 'pandas.core.frame.DataFrame'> RangeIndex: 48744 entries, 0 to 48743 Columns: 121 entries, SK_ID_CURR to AMT_REQ_CREDIT_BUREAU_YEAR dtypes: float64(65), int64(40), object(16) memory usage: 45.0+ MB None
| SK_ID_CURR | NAME_CONTRACT_TYPE | CODE_GENDER | FLAG_OWN_CAR | FLAG_OWN_REALTY | CNT_CHILDREN | AMT_INCOME_TOTAL | AMT_CREDIT | AMT_ANNUITY | AMT_GOODS_PRICE | ... | FLAG_DOCUMENT_18 | FLAG_DOCUMENT_19 | FLAG_DOCUMENT_20 | FLAG_DOCUMENT_21 | AMT_REQ_CREDIT_BUREAU_HOUR | AMT_REQ_CREDIT_BUREAU_DAY | AMT_REQ_CREDIT_BUREAU_WEEK | AMT_REQ_CREDIT_BUREAU_MON | AMT_REQ_CREDIT_BUREAU_QRT | AMT_REQ_CREDIT_BUREAU_YEAR | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 100001 | Cash loans | F | N | Y | 0 | 135000.0 | 568800.0 | 20560.5 | 450000.0 | ... | 0 | 0 | 0 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
| 1 | 100005 | Cash loans | M | N | Y | 0 | 99000.0 | 222768.0 | 17370.0 | 180000.0 | ... | 0 | 0 | 0 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 3.0 |
| 2 | 100013 | Cash loans | M | Y | Y | 0 | 202500.0 | 663264.0 | 69777.0 | 630000.0 | ... | 0 | 0 | 0 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 4.0 |
| 3 | 100028 | Cash loans | F | N | Y | 2 | 315000.0 | 1575000.0 | 49018.5 | 1575000.0 | ... | 0 | 0 | 0 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 3.0 |
| 4 | 100038 | Cash loans | M | Y | N | 1 | 180000.0 | 625500.0 | 32067.0 | 625500.0 | ... | 0 | 0 | 0 | 0 | NaN | NaN | NaN | NaN | NaN | NaN |
5 rows × 121 columns
The application dataset has the most information about the client: Gender, income, family status, education ...
%%time
ds_names = ("application_train", "application_test", "bureau","bureau_balance","credit_card_balance","installments_payments",
"previous_application","POS_CASH_balance")
for ds_name in ds_names:
datasets[ds_name] = load_data(os.path.join(DATA_DIR, f'{ds_name}.csv'), ds_name)
application_train: shape is (307511, 122) <class 'pandas.core.frame.DataFrame'> RangeIndex: 307511 entries, 0 to 307510 Columns: 122 entries, SK_ID_CURR to AMT_REQ_CREDIT_BUREAU_YEAR dtypes: float64(65), int64(41), object(16) memory usage: 286.2+ MB None
| SK_ID_CURR | TARGET | NAME_CONTRACT_TYPE | CODE_GENDER | FLAG_OWN_CAR | FLAG_OWN_REALTY | CNT_CHILDREN | AMT_INCOME_TOTAL | AMT_CREDIT | AMT_ANNUITY | ... | FLAG_DOCUMENT_18 | FLAG_DOCUMENT_19 | FLAG_DOCUMENT_20 | FLAG_DOCUMENT_21 | AMT_REQ_CREDIT_BUREAU_HOUR | AMT_REQ_CREDIT_BUREAU_DAY | AMT_REQ_CREDIT_BUREAU_WEEK | AMT_REQ_CREDIT_BUREAU_MON | AMT_REQ_CREDIT_BUREAU_QRT | AMT_REQ_CREDIT_BUREAU_YEAR | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 100002 | 1 | Cash loans | M | N | Y | 0 | 202500.0 | 406597.5 | 24700.5 | ... | 0 | 0 | 0 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 |
| 1 | 100003 | 0 | Cash loans | F | N | N | 0 | 270000.0 | 1293502.5 | 35698.5 | ... | 0 | 0 | 0 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
| 2 | 100004 | 0 | Revolving loans | M | Y | Y | 0 | 67500.0 | 135000.0 | 6750.0 | ... | 0 | 0 | 0 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
| 3 | 100006 | 0 | Cash loans | F | N | Y | 0 | 135000.0 | 312682.5 | 29686.5 | ... | 0 | 0 | 0 | 0 | NaN | NaN | NaN | NaN | NaN | NaN |
| 4 | 100007 | 0 | Cash loans | M | N | Y | 0 | 121500.0 | 513000.0 | 21865.5 | ... | 0 | 0 | 0 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
5 rows × 122 columns
application_test: shape is (48744, 121) <class 'pandas.core.frame.DataFrame'> RangeIndex: 48744 entries, 0 to 48743 Columns: 121 entries, SK_ID_CURR to AMT_REQ_CREDIT_BUREAU_YEAR dtypes: float64(65), int64(40), object(16) memory usage: 45.0+ MB None
| SK_ID_CURR | NAME_CONTRACT_TYPE | CODE_GENDER | FLAG_OWN_CAR | FLAG_OWN_REALTY | CNT_CHILDREN | AMT_INCOME_TOTAL | AMT_CREDIT | AMT_ANNUITY | AMT_GOODS_PRICE | ... | FLAG_DOCUMENT_18 | FLAG_DOCUMENT_19 | FLAG_DOCUMENT_20 | FLAG_DOCUMENT_21 | AMT_REQ_CREDIT_BUREAU_HOUR | AMT_REQ_CREDIT_BUREAU_DAY | AMT_REQ_CREDIT_BUREAU_WEEK | AMT_REQ_CREDIT_BUREAU_MON | AMT_REQ_CREDIT_BUREAU_QRT | AMT_REQ_CREDIT_BUREAU_YEAR | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 100001 | Cash loans | F | N | Y | 0 | 135000.0 | 568800.0 | 20560.5 | 450000.0 | ... | 0 | 0 | 0 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
| 1 | 100005 | Cash loans | M | N | Y | 0 | 99000.0 | 222768.0 | 17370.0 | 180000.0 | ... | 0 | 0 | 0 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 3.0 |
| 2 | 100013 | Cash loans | M | Y | Y | 0 | 202500.0 | 663264.0 | 69777.0 | 630000.0 | ... | 0 | 0 | 0 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 4.0 |
| 3 | 100028 | Cash loans | F | N | Y | 2 | 315000.0 | 1575000.0 | 49018.5 | 1575000.0 | ... | 0 | 0 | 0 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 3.0 |
| 4 | 100038 | Cash loans | M | Y | N | 1 | 180000.0 | 625500.0 | 32067.0 | 625500.0 | ... | 0 | 0 | 0 | 0 | NaN | NaN | NaN | NaN | NaN | NaN |
5 rows × 121 columns
bureau: shape is (1716428, 17) <class 'pandas.core.frame.DataFrame'> RangeIndex: 1716428 entries, 0 to 1716427 Data columns (total 17 columns): # Column Dtype --- ------ ----- 0 SK_ID_CURR int64 1 SK_ID_BUREAU int64 2 CREDIT_ACTIVE object 3 CREDIT_CURRENCY object 4 DAYS_CREDIT int64 5 CREDIT_DAY_OVERDUE int64 6 DAYS_CREDIT_ENDDATE float64 7 DAYS_ENDDATE_FACT float64 8 AMT_CREDIT_MAX_OVERDUE float64 9 CNT_CREDIT_PROLONG int64 10 AMT_CREDIT_SUM float64 11 AMT_CREDIT_SUM_DEBT float64 12 AMT_CREDIT_SUM_LIMIT float64 13 AMT_CREDIT_SUM_OVERDUE float64 14 CREDIT_TYPE object 15 DAYS_CREDIT_UPDATE int64 16 AMT_ANNUITY float64 dtypes: float64(8), int64(6), object(3) memory usage: 222.6+ MB None
| SK_ID_CURR | SK_ID_BUREAU | CREDIT_ACTIVE | CREDIT_CURRENCY | DAYS_CREDIT | CREDIT_DAY_OVERDUE | DAYS_CREDIT_ENDDATE | DAYS_ENDDATE_FACT | AMT_CREDIT_MAX_OVERDUE | CNT_CREDIT_PROLONG | AMT_CREDIT_SUM | AMT_CREDIT_SUM_DEBT | AMT_CREDIT_SUM_LIMIT | AMT_CREDIT_SUM_OVERDUE | CREDIT_TYPE | DAYS_CREDIT_UPDATE | AMT_ANNUITY | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 215354 | 5714462 | Closed | currency 1 | -497 | 0 | -153.0 | -153.0 | NaN | 0 | 91323.0 | 0.0 | NaN | 0.0 | Consumer credit | -131 | NaN |
| 1 | 215354 | 5714463 | Active | currency 1 | -208 | 0 | 1075.0 | NaN | NaN | 0 | 225000.0 | 171342.0 | NaN | 0.0 | Credit card | -20 | NaN |
| 2 | 215354 | 5714464 | Active | currency 1 | -203 | 0 | 528.0 | NaN | NaN | 0 | 464323.5 | NaN | NaN | 0.0 | Consumer credit | -16 | NaN |
| 3 | 215354 | 5714465 | Active | currency 1 | -203 | 0 | NaN | NaN | NaN | 0 | 90000.0 | NaN | NaN | 0.0 | Credit card | -16 | NaN |
| 4 | 215354 | 5714466 | Active | currency 1 | -629 | 0 | 1197.0 | NaN | 77674.5 | 0 | 2700000.0 | NaN | NaN | 0.0 | Consumer credit | -21 | NaN |
bureau_balance: shape is (27299925, 3) <class 'pandas.core.frame.DataFrame'> RangeIndex: 27299925 entries, 0 to 27299924 Data columns (total 3 columns): # Column Dtype --- ------ ----- 0 SK_ID_BUREAU int64 1 MONTHS_BALANCE int64 2 STATUS object dtypes: int64(2), object(1) memory usage: 624.8+ MB None
| SK_ID_BUREAU | MONTHS_BALANCE | STATUS | |
|---|---|---|---|
| 0 | 5715448 | 0 | C |
| 1 | 5715448 | -1 | C |
| 2 | 5715448 | -2 | C |
| 3 | 5715448 | -3 | C |
| 4 | 5715448 | -4 | C |
credit_card_balance: shape is (3840312, 23) <class 'pandas.core.frame.DataFrame'> RangeIndex: 3840312 entries, 0 to 3840311 Data columns (total 23 columns): # Column Dtype --- ------ ----- 0 SK_ID_PREV int64 1 SK_ID_CURR int64 2 MONTHS_BALANCE int64 3 AMT_BALANCE float64 4 AMT_CREDIT_LIMIT_ACTUAL int64 5 AMT_DRAWINGS_ATM_CURRENT float64 6 AMT_DRAWINGS_CURRENT float64 7 AMT_DRAWINGS_OTHER_CURRENT float64 8 AMT_DRAWINGS_POS_CURRENT float64 9 AMT_INST_MIN_REGULARITY float64 10 AMT_PAYMENT_CURRENT float64 11 AMT_PAYMENT_TOTAL_CURRENT float64 12 AMT_RECEIVABLE_PRINCIPAL float64 13 AMT_RECIVABLE float64 14 AMT_TOTAL_RECEIVABLE float64 15 CNT_DRAWINGS_ATM_CURRENT float64 16 CNT_DRAWINGS_CURRENT int64 17 CNT_DRAWINGS_OTHER_CURRENT float64 18 CNT_DRAWINGS_POS_CURRENT float64 19 CNT_INSTALMENT_MATURE_CUM float64 20 NAME_CONTRACT_STATUS object 21 SK_DPD int64 22 SK_DPD_DEF int64 dtypes: float64(15), int64(7), object(1) memory usage: 673.9+ MB None
| SK_ID_PREV | SK_ID_CURR | MONTHS_BALANCE | AMT_BALANCE | AMT_CREDIT_LIMIT_ACTUAL | AMT_DRAWINGS_ATM_CURRENT | AMT_DRAWINGS_CURRENT | AMT_DRAWINGS_OTHER_CURRENT | AMT_DRAWINGS_POS_CURRENT | AMT_INST_MIN_REGULARITY | ... | AMT_RECIVABLE | AMT_TOTAL_RECEIVABLE | CNT_DRAWINGS_ATM_CURRENT | CNT_DRAWINGS_CURRENT | CNT_DRAWINGS_OTHER_CURRENT | CNT_DRAWINGS_POS_CURRENT | CNT_INSTALMENT_MATURE_CUM | NAME_CONTRACT_STATUS | SK_DPD | SK_DPD_DEF | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 2562384 | 378907 | -6 | 56.970 | 135000 | 0.0 | 877.5 | 0.0 | 877.5 | 1700.325 | ... | 0.000 | 0.000 | 0.0 | 1 | 0.0 | 1.0 | 35.0 | Active | 0 | 0 |
| 1 | 2582071 | 363914 | -1 | 63975.555 | 45000 | 2250.0 | 2250.0 | 0.0 | 0.0 | 2250.000 | ... | 64875.555 | 64875.555 | 1.0 | 1 | 0.0 | 0.0 | 69.0 | Active | 0 | 0 |
| 2 | 1740877 | 371185 | -7 | 31815.225 | 450000 | 0.0 | 0.0 | 0.0 | 0.0 | 2250.000 | ... | 31460.085 | 31460.085 | 0.0 | 0 | 0.0 | 0.0 | 30.0 | Active | 0 | 0 |
| 3 | 1389973 | 337855 | -4 | 236572.110 | 225000 | 2250.0 | 2250.0 | 0.0 | 0.0 | 11795.760 | ... | 233048.970 | 233048.970 | 1.0 | 1 | 0.0 | 0.0 | 10.0 | Active | 0 | 0 |
| 4 | 1891521 | 126868 | -1 | 453919.455 | 450000 | 0.0 | 11547.0 | 0.0 | 11547.0 | 22924.890 | ... | 453919.455 | 453919.455 | 0.0 | 1 | 0.0 | 1.0 | 101.0 | Active | 0 | 0 |
5 rows × 23 columns
installments_payments: shape is (13605401, 8) <class 'pandas.core.frame.DataFrame'> RangeIndex: 13605401 entries, 0 to 13605400 Data columns (total 8 columns): # Column Dtype --- ------ ----- 0 SK_ID_PREV int64 1 SK_ID_CURR int64 2 NUM_INSTALMENT_VERSION float64 3 NUM_INSTALMENT_NUMBER int64 4 DAYS_INSTALMENT float64 5 DAYS_ENTRY_PAYMENT float64 6 AMT_INSTALMENT float64 7 AMT_PAYMENT float64 dtypes: float64(5), int64(3) memory usage: 830.4 MB None
| SK_ID_PREV | SK_ID_CURR | NUM_INSTALMENT_VERSION | NUM_INSTALMENT_NUMBER | DAYS_INSTALMENT | DAYS_ENTRY_PAYMENT | AMT_INSTALMENT | AMT_PAYMENT | |
|---|---|---|---|---|---|---|---|---|
| 0 | 1054186 | 161674 | 1.0 | 6 | -1180.0 | -1187.0 | 6948.360 | 6948.360 |
| 1 | 1330831 | 151639 | 0.0 | 34 | -2156.0 | -2156.0 | 1716.525 | 1716.525 |
| 2 | 2085231 | 193053 | 2.0 | 1 | -63.0 | -63.0 | 25425.000 | 25425.000 |
| 3 | 2452527 | 199697 | 1.0 | 3 | -2418.0 | -2426.0 | 24350.130 | 24350.130 |
| 4 | 2714724 | 167756 | 1.0 | 2 | -1383.0 | -1366.0 | 2165.040 | 2160.585 |
previous_application: shape is (1670214, 37) <class 'pandas.core.frame.DataFrame'> RangeIndex: 1670214 entries, 0 to 1670213 Data columns (total 37 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 SK_ID_PREV 1670214 non-null int64 1 SK_ID_CURR 1670214 non-null int64 2 NAME_CONTRACT_TYPE 1670214 non-null object 3 AMT_ANNUITY 1297979 non-null float64 4 AMT_APPLICATION 1670214 non-null float64 5 AMT_CREDIT 1670213 non-null float64 6 AMT_DOWN_PAYMENT 774370 non-null float64 7 AMT_GOODS_PRICE 1284699 non-null float64 8 WEEKDAY_APPR_PROCESS_START 1670214 non-null object 9 HOUR_APPR_PROCESS_START 1670214 non-null int64 10 FLAG_LAST_APPL_PER_CONTRACT 1670214 non-null object 11 NFLAG_LAST_APPL_IN_DAY 1670214 non-null int64 12 RATE_DOWN_PAYMENT 774370 non-null float64 13 RATE_INTEREST_PRIMARY 5951 non-null float64 14 RATE_INTEREST_PRIVILEGED 5951 non-null float64 15 NAME_CASH_LOAN_PURPOSE 1670214 non-null object 16 NAME_CONTRACT_STATUS 1670214 non-null object 17 DAYS_DECISION 1670214 non-null int64 18 NAME_PAYMENT_TYPE 1670214 non-null object 19 CODE_REJECT_REASON 1670214 non-null object 20 NAME_TYPE_SUITE 849809 non-null object 21 NAME_CLIENT_TYPE 1670214 non-null object 22 NAME_GOODS_CATEGORY 1670214 non-null object 23 NAME_PORTFOLIO 1670214 non-null object 24 NAME_PRODUCT_TYPE 1670214 non-null object 25 CHANNEL_TYPE 1670214 non-null object 26 SELLERPLACE_AREA 1670214 non-null int64 27 NAME_SELLER_INDUSTRY 1670214 non-null object 28 CNT_PAYMENT 1297984 non-null float64 29 NAME_YIELD_GROUP 1670214 non-null object 30 PRODUCT_COMBINATION 1669868 non-null object 31 DAYS_FIRST_DRAWING 997149 non-null float64 32 DAYS_FIRST_DUE 997149 non-null float64 33 DAYS_LAST_DUE_1ST_VERSION 997149 non-null float64 34 DAYS_LAST_DUE 997149 non-null float64 35 DAYS_TERMINATION 997149 non-null float64 36 NFLAG_INSURED_ON_APPROVAL 997149 non-null float64 dtypes: float64(15), int64(6), object(16) memory usage: 471.5+ MB None
| SK_ID_PREV | SK_ID_CURR | NAME_CONTRACT_TYPE | AMT_ANNUITY | AMT_APPLICATION | AMT_CREDIT | AMT_DOWN_PAYMENT | AMT_GOODS_PRICE | WEEKDAY_APPR_PROCESS_START | HOUR_APPR_PROCESS_START | ... | NAME_SELLER_INDUSTRY | CNT_PAYMENT | NAME_YIELD_GROUP | PRODUCT_COMBINATION | DAYS_FIRST_DRAWING | DAYS_FIRST_DUE | DAYS_LAST_DUE_1ST_VERSION | DAYS_LAST_DUE | DAYS_TERMINATION | NFLAG_INSURED_ON_APPROVAL | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 2030495 | 271877 | Consumer loans | 1730.430 | 17145.0 | 17145.0 | 0.0 | 17145.0 | SATURDAY | 15 | ... | Connectivity | 12.0 | middle | POS mobile with interest | 365243.0 | -42.0 | 300.0 | -42.0 | -37.0 | 0.0 |
| 1 | 2802425 | 108129 | Cash loans | 25188.615 | 607500.0 | 679671.0 | NaN | 607500.0 | THURSDAY | 11 | ... | XNA | 36.0 | low_action | Cash X-Sell: low | 365243.0 | -134.0 | 916.0 | 365243.0 | 365243.0 | 1.0 |
| 2 | 2523466 | 122040 | Cash loans | 15060.735 | 112500.0 | 136444.5 | NaN | 112500.0 | TUESDAY | 11 | ... | XNA | 12.0 | high | Cash X-Sell: high | 365243.0 | -271.0 | 59.0 | 365243.0 | 365243.0 | 1.0 |
| 3 | 2819243 | 176158 | Cash loans | 47041.335 | 450000.0 | 470790.0 | NaN | 450000.0 | MONDAY | 7 | ... | XNA | 12.0 | middle | Cash X-Sell: middle | 365243.0 | -482.0 | -152.0 | -182.0 | -177.0 | 1.0 |
| 4 | 1784265 | 202054 | Cash loans | 31924.395 | 337500.0 | 404055.0 | NaN | 337500.0 | THURSDAY | 9 | ... | XNA | 24.0 | high | Cash Street: high | NaN | NaN | NaN | NaN | NaN | NaN |
5 rows × 37 columns
POS_CASH_balance: shape is (10001358, 8) <class 'pandas.core.frame.DataFrame'> RangeIndex: 10001358 entries, 0 to 10001357 Data columns (total 8 columns): # Column Dtype --- ------ ----- 0 SK_ID_PREV int64 1 SK_ID_CURR int64 2 MONTHS_BALANCE int64 3 CNT_INSTALMENT float64 4 CNT_INSTALMENT_FUTURE float64 5 NAME_CONTRACT_STATUS object 6 SK_DPD int64 7 SK_DPD_DEF int64 dtypes: float64(2), int64(5), object(1) memory usage: 610.4+ MB None
| SK_ID_PREV | SK_ID_CURR | MONTHS_BALANCE | CNT_INSTALMENT | CNT_INSTALMENT_FUTURE | NAME_CONTRACT_STATUS | SK_DPD | SK_DPD_DEF | |
|---|---|---|---|---|---|---|---|---|
| 0 | 1803195 | 182943 | -31 | 48.0 | 45.0 | Active | 0 | 0 |
| 1 | 1715348 | 367990 | -33 | 36.0 | 35.0 | Active | 0 | 0 |
| 2 | 1784872 | 397406 | -32 | 12.0 | 9.0 | Active | 0 | 0 |
| 3 | 1903291 | 269225 | -35 | 48.0 | 42.0 | Active | 0 | 0 |
| 4 | 2341044 | 334279 | -35 | 36.0 | 35.0 | Active | 0 | 0 |
CPU times: user 36.8 s, sys: 4.92 s, total: 41.7 s Wall time: 41.7 s
for ds_name in datasets.keys():
print(f'dataset {ds_name:24}: [ {datasets[ds_name].shape[0]:10,}, {datasets[ds_name].shape[1]}]')
dataset application_train : [ 307,511, 122] dataset application_test : [ 48,744, 121] dataset bureau : [ 1,716,428, 17] dataset bureau_balance : [ 27,299,925, 3] dataset credit_card_balance : [ 3,840,312, 23] dataset installments_payments : [ 13,605,401, 8] dataset previous_application : [ 1,670,214, 37] dataset POS_CASH_balance : [ 10,001,358, 8]
datasets["application_train"].info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 307511 entries, 0 to 307510 Columns: 122 entries, SK_ID_CURR to AMT_REQ_CREDIT_BUREAU_YEAR dtypes: float64(65), int64(41), object(16) memory usage: 286.2+ MB
datasets["application_train"].describe() #numerical only features
| SK_ID_CURR | TARGET | CNT_CHILDREN | AMT_INCOME_TOTAL | AMT_CREDIT | AMT_ANNUITY | AMT_GOODS_PRICE | REGION_POPULATION_RELATIVE | DAYS_BIRTH | DAYS_EMPLOYED | ... | FLAG_DOCUMENT_18 | FLAG_DOCUMENT_19 | FLAG_DOCUMENT_20 | FLAG_DOCUMENT_21 | AMT_REQ_CREDIT_BUREAU_HOUR | AMT_REQ_CREDIT_BUREAU_DAY | AMT_REQ_CREDIT_BUREAU_WEEK | AMT_REQ_CREDIT_BUREAU_MON | AMT_REQ_CREDIT_BUREAU_QRT | AMT_REQ_CREDIT_BUREAU_YEAR | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 307511.000000 | 307511.000000 | 307511.000000 | 3.075110e+05 | 3.075110e+05 | 307499.000000 | 3.072330e+05 | 307511.000000 | 307511.000000 | 307511.000000 | ... | 307511.000000 | 307511.000000 | 307511.000000 | 307511.000000 | 265992.000000 | 265992.000000 | 265992.000000 | 265992.000000 | 265992.000000 | 265992.000000 |
| mean | 278180.518577 | 0.080729 | 0.417052 | 1.687979e+05 | 5.990260e+05 | 27108.573909 | 5.383962e+05 | 0.020868 | -16036.995067 | 63815.045904 | ... | 0.008130 | 0.000595 | 0.000507 | 0.000335 | 0.006402 | 0.007000 | 0.034362 | 0.267395 | 0.265474 | 1.899974 |
| std | 102790.175348 | 0.272419 | 0.722121 | 2.371231e+05 | 4.024908e+05 | 14493.737315 | 3.694465e+05 | 0.013831 | 4363.988632 | 141275.766519 | ... | 0.089798 | 0.024387 | 0.022518 | 0.018299 | 0.083849 | 0.110757 | 0.204685 | 0.916002 | 0.794056 | 1.869295 |
| min | 100002.000000 | 0.000000 | 0.000000 | 2.565000e+04 | 4.500000e+04 | 1615.500000 | 4.050000e+04 | 0.000290 | -25229.000000 | -17912.000000 | ... | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 |
| 25% | 189145.500000 | 0.000000 | 0.000000 | 1.125000e+05 | 2.700000e+05 | 16524.000000 | 2.385000e+05 | 0.010006 | -19682.000000 | -2760.000000 | ... | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 |
| 50% | 278202.000000 | 0.000000 | 0.000000 | 1.471500e+05 | 5.135310e+05 | 24903.000000 | 4.500000e+05 | 0.018850 | -15750.000000 | -1213.000000 | ... | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 1.000000 |
| 75% | 367142.500000 | 0.000000 | 1.000000 | 2.025000e+05 | 8.086500e+05 | 34596.000000 | 6.795000e+05 | 0.028663 | -12413.000000 | -289.000000 | ... | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 3.000000 |
| max | 456255.000000 | 1.000000 | 19.000000 | 1.170000e+08 | 4.050000e+06 | 258025.500000 | 4.050000e+06 | 0.072508 | -7489.000000 | 365243.000000 | ... | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 4.000000 | 9.000000 | 8.000000 | 27.000000 | 261.000000 | 25.000000 |
8 rows × 106 columns
datasets["application_test"].describe() #numerical only features
| SK_ID_CURR | CNT_CHILDREN | AMT_INCOME_TOTAL | AMT_CREDIT | AMT_ANNUITY | AMT_GOODS_PRICE | REGION_POPULATION_RELATIVE | DAYS_BIRTH | DAYS_EMPLOYED | DAYS_REGISTRATION | ... | FLAG_DOCUMENT_18 | FLAG_DOCUMENT_19 | FLAG_DOCUMENT_20 | FLAG_DOCUMENT_21 | AMT_REQ_CREDIT_BUREAU_HOUR | AMT_REQ_CREDIT_BUREAU_DAY | AMT_REQ_CREDIT_BUREAU_WEEK | AMT_REQ_CREDIT_BUREAU_MON | AMT_REQ_CREDIT_BUREAU_QRT | AMT_REQ_CREDIT_BUREAU_YEAR | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 48744.000000 | 48744.000000 | 4.874400e+04 | 4.874400e+04 | 48720.000000 | 4.874400e+04 | 48744.000000 | 48744.000000 | 48744.000000 | 48744.000000 | ... | 48744.000000 | 48744.0 | 48744.0 | 48744.0 | 42695.000000 | 42695.000000 | 42695.000000 | 42695.000000 | 42695.000000 | 42695.000000 |
| mean | 277796.676350 | 0.397054 | 1.784318e+05 | 5.167404e+05 | 29426.240209 | 4.626188e+05 | 0.021226 | -16068.084605 | 67485.366322 | -4967.652716 | ... | 0.001559 | 0.0 | 0.0 | 0.0 | 0.002108 | 0.001803 | 0.002787 | 0.009299 | 0.546902 | 1.983769 |
| std | 103169.547296 | 0.709047 | 1.015226e+05 | 3.653970e+05 | 16016.368315 | 3.367102e+05 | 0.014428 | 4325.900393 | 144348.507136 | 3552.612035 | ... | 0.039456 | 0.0 | 0.0 | 0.0 | 0.046373 | 0.046132 | 0.054037 | 0.110924 | 0.693305 | 1.838873 |
| min | 100001.000000 | 0.000000 | 2.694150e+04 | 4.500000e+04 | 2295.000000 | 4.500000e+04 | 0.000253 | -25195.000000 | -17463.000000 | -23722.000000 | ... | 0.000000 | 0.0 | 0.0 | 0.0 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 |
| 25% | 188557.750000 | 0.000000 | 1.125000e+05 | 2.606400e+05 | 17973.000000 | 2.250000e+05 | 0.010006 | -19637.000000 | -2910.000000 | -7459.250000 | ... | 0.000000 | 0.0 | 0.0 | 0.0 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 |
| 50% | 277549.000000 | 0.000000 | 1.575000e+05 | 4.500000e+05 | 26199.000000 | 3.960000e+05 | 0.018850 | -15785.000000 | -1293.000000 | -4490.000000 | ... | 0.000000 | 0.0 | 0.0 | 0.0 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 2.000000 |
| 75% | 367555.500000 | 1.000000 | 2.250000e+05 | 6.750000e+05 | 37390.500000 | 6.300000e+05 | 0.028663 | -12496.000000 | -296.000000 | -1901.000000 | ... | 0.000000 | 0.0 | 0.0 | 0.0 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 1.000000 | 3.000000 |
| max | 456250.000000 | 20.000000 | 4.410000e+06 | 2.245500e+06 | 180576.000000 | 2.245500e+06 | 0.072508 | -7338.000000 | 365243.000000 | 0.000000 | ... | 1.000000 | 0.0 | 0.0 | 0.0 | 2.000000 | 2.000000 | 2.000000 | 6.000000 | 7.000000 | 17.000000 |
8 rows × 105 columns
datasets["application_train"].describe(include='all') #look at all categorical and numerical
| SK_ID_CURR | TARGET | NAME_CONTRACT_TYPE | CODE_GENDER | FLAG_OWN_CAR | FLAG_OWN_REALTY | CNT_CHILDREN | AMT_INCOME_TOTAL | AMT_CREDIT | AMT_ANNUITY | ... | FLAG_DOCUMENT_18 | FLAG_DOCUMENT_19 | FLAG_DOCUMENT_20 | FLAG_DOCUMENT_21 | AMT_REQ_CREDIT_BUREAU_HOUR | AMT_REQ_CREDIT_BUREAU_DAY | AMT_REQ_CREDIT_BUREAU_WEEK | AMT_REQ_CREDIT_BUREAU_MON | AMT_REQ_CREDIT_BUREAU_QRT | AMT_REQ_CREDIT_BUREAU_YEAR | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 307511.000000 | 307511.000000 | 307511 | 307511 | 307511 | 307511 | 307511.000000 | 3.075110e+05 | 3.075110e+05 | 307499.000000 | ... | 307511.000000 | 307511.000000 | 307511.000000 | 307511.000000 | 265992.000000 | 265992.000000 | 265992.000000 | 265992.000000 | 265992.000000 | 265992.000000 |
| unique | NaN | NaN | 2 | 3 | 2 | 2 | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| top | NaN | NaN | Cash loans | F | N | Y | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| freq | NaN | NaN | 278232 | 202448 | 202924 | 213312 | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| mean | 278180.518577 | 0.080729 | NaN | NaN | NaN | NaN | 0.417052 | 1.687979e+05 | 5.990260e+05 | 27108.573909 | ... | 0.008130 | 0.000595 | 0.000507 | 0.000335 | 0.006402 | 0.007000 | 0.034362 | 0.267395 | 0.265474 | 1.899974 |
| std | 102790.175348 | 0.272419 | NaN | NaN | NaN | NaN | 0.722121 | 2.371231e+05 | 4.024908e+05 | 14493.737315 | ... | 0.089798 | 0.024387 | 0.022518 | 0.018299 | 0.083849 | 0.110757 | 0.204685 | 0.916002 | 0.794056 | 1.869295 |
| min | 100002.000000 | 0.000000 | NaN | NaN | NaN | NaN | 0.000000 | 2.565000e+04 | 4.500000e+04 | 1615.500000 | ... | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 |
| 25% | 189145.500000 | 0.000000 | NaN | NaN | NaN | NaN | 0.000000 | 1.125000e+05 | 2.700000e+05 | 16524.000000 | ... | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 |
| 50% | 278202.000000 | 0.000000 | NaN | NaN | NaN | NaN | 0.000000 | 1.471500e+05 | 5.135310e+05 | 24903.000000 | ... | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 1.000000 |
| 75% | 367142.500000 | 0.000000 | NaN | NaN | NaN | NaN | 1.000000 | 2.025000e+05 | 8.086500e+05 | 34596.000000 | ... | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 3.000000 |
| max | 456255.000000 | 1.000000 | NaN | NaN | NaN | NaN | 19.000000 | 1.170000e+08 | 4.050000e+06 | 258025.500000 | ... | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 4.000000 | 9.000000 | 8.000000 | 27.000000 | 261.000000 | 25.000000 |
11 rows × 122 columns
datasets["application_train"].size
37516342
datasets["application_train"].shape
(307511, 122)
X = datasets["application_train"].drop(['TARGET'], axis = 1)
y = datasets["application_train"]["TARGET"]
# Split the provided training data into training and validationa and test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=42)
X_train, X_valid, y_train, y_valid = train_test_split(X_train, y_train, test_size=0.2, random_state=42)
print(f"X train shape: {X_train.shape}")
print(f"X validation shape: {X_valid.shape}")
print(f"X test shape: {X_test.shape}")
X train shape: (196806, 121) X validation shape: (49202, 121) X test shape: (61503, 121)
Summary Statistics -
X_train.info()
X_train.describe()
<class 'pandas.core.frame.DataFrame'> Int64Index: 196806 entries, 9717 to 255 Columns: 121 entries, SK_ID_CURR to AMT_REQ_CREDIT_BUREAU_YEAR dtypes: float64(65), int64(40), object(16) memory usage: 183.2+ MB
| SK_ID_CURR | CNT_CHILDREN | AMT_INCOME_TOTAL | AMT_CREDIT | AMT_ANNUITY | AMT_GOODS_PRICE | REGION_POPULATION_RELATIVE | DAYS_BIRTH | DAYS_EMPLOYED | DAYS_REGISTRATION | ... | FLAG_DOCUMENT_18 | FLAG_DOCUMENT_19 | FLAG_DOCUMENT_20 | FLAG_DOCUMENT_21 | AMT_REQ_CREDIT_BUREAU_HOUR | AMT_REQ_CREDIT_BUREAU_DAY | AMT_REQ_CREDIT_BUREAU_WEEK | AMT_REQ_CREDIT_BUREAU_MON | AMT_REQ_CREDIT_BUREAU_QRT | AMT_REQ_CREDIT_BUREAU_YEAR | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 196806.000000 | 196806.000000 | 1.968060e+05 | 1.968060e+05 | 196798.000000 | 1.966250e+05 | 196806.000000 | 196806.000000 | 196806.000000 | 196806.000000 | ... | 196806.000000 | 196806.000000 | 196806.000000 | 196806.000000 | 170305.000000 | 170305.000000 | 170305.000000 | 170305.000000 | 170305.000000 | 170305.000000 |
| mean | 278195.549368 | 0.416786 | 1.683316e+05 | 5.993323e+05 | 27109.915949 | 5.387623e+05 | 0.020848 | -16048.994370 | 63951.217737 | -4989.446059 | ... | 0.008125 | 0.000569 | 0.000478 | 0.000295 | 0.006588 | 0.006852 | 0.034062 | 0.265758 | 0.267391 | 1.899797 |
| std | 102732.472419 | 0.719989 | 1.055828e+05 | 4.029388e+05 | 14475.618426 | 3.699055e+05 | 0.013813 | 4361.083932 | 141391.441992 | 3523.627220 | ... | 0.089771 | 0.023849 | 0.021850 | 0.017165 | 0.085075 | 0.107634 | 0.203568 | 0.907199 | 0.880501 | 1.869260 |
| min | 100003.000000 | 0.000000 | 2.565000e+04 | 4.500000e+04 | 1993.500000 | 4.050000e+04 | 0.000290 | -25229.000000 | -17912.000000 | -23416.000000 | ... | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 |
| 25% | 189184.250000 | 0.000000 | 1.125000e+05 | 2.700000e+05 | 16524.000000 | 2.385000e+05 | 0.010006 | -19687.000000 | -2759.000000 | -7486.750000 | ... | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 |
| 50% | 278235.500000 | 0.000000 | 1.440000e+05 | 5.147775e+05 | 24907.500000 | 4.500000e+05 | 0.018850 | -15773.000000 | -1211.000000 | -4507.000000 | ... | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 1.000000 |
| 75% | 367111.750000 | 1.000000 | 2.025000e+05 | 8.086500e+05 | 34609.500000 | 6.795000e+05 | 0.028663 | -12432.000000 | -288.000000 | -2008.000000 | ... | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 3.000000 |
| max | 456254.000000 | 19.000000 | 1.350000e+07 | 4.050000e+06 | 258025.500000 | 4.050000e+06 | 0.072508 | -7489.000000 | 365243.000000 | 0.000000 | ... | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 3.000000 | 9.000000 | 8.000000 | 27.000000 | 261.000000 | 25.000000 |
8 rows × 105 columns
# Concatenating X_train and y_train
Xy_train = pd.concat([X_train, y_train], axis=1)
Xy_train.head()
| SK_ID_CURR | NAME_CONTRACT_TYPE | CODE_GENDER | FLAG_OWN_CAR | FLAG_OWN_REALTY | CNT_CHILDREN | AMT_INCOME_TOTAL | AMT_CREDIT | AMT_ANNUITY | AMT_GOODS_PRICE | ... | FLAG_DOCUMENT_19 | FLAG_DOCUMENT_20 | FLAG_DOCUMENT_21 | AMT_REQ_CREDIT_BUREAU_HOUR | AMT_REQ_CREDIT_BUREAU_DAY | AMT_REQ_CREDIT_BUREAU_WEEK | AMT_REQ_CREDIT_BUREAU_MON | AMT_REQ_CREDIT_BUREAU_QRT | AMT_REQ_CREDIT_BUREAU_YEAR | TARGET | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 9717 | 111307 | Cash loans | F | N | Y | 0 | 112500.0 | 1078200.0 | 29650.5 | 900000.0 | ... | 0 | 0 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0 |
| 203356 | 335752 | Cash loans | F | Y | Y | 1 | 247500.0 | 1125000.0 | 44748.0 | 1125000.0 | ... | 0 | 0 | 0 | 0.0 | 0.0 | 0.0 | 4.0 | 0.0 | 1.0 | 0 |
| 81757 | 194805 | Cash loans | F | Y | Y | 1 | 180000.0 | 417024.0 | 22621.5 | 360000.0 | ... | 0 | 0 | 0 | NaN | NaN | NaN | NaN | NaN | NaN | 0 |
| 84860 | 198457 | Cash loans | F | N | Y | 0 | 247500.0 | 1078200.0 | 34780.5 | 900000.0 | ... | 0 | 0 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0 |
| 234668 | 371838 | Cash loans | F | N | Y | 1 | 135000.0 | 824823.0 | 24246.0 | 688500.0 | ... | 0 | 0 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 2.0 | 0 |
5 rows × 122 columns
numerical_features = X_train.select_dtypes(include = ['int64', 'float64']).columns
categorical_features = X_train.select_dtypes(include = ['object', 'bool']).columns
print(f"\nNumerical features : {list(numerical_features)}")
print(f"\nCategorical features : {list(categorical_features)}")
Numerical features : ['SK_ID_CURR', 'CNT_CHILDREN', 'AMT_INCOME_TOTAL', 'AMT_CREDIT', 'AMT_ANNUITY', 'AMT_GOODS_PRICE', 'REGION_POPULATION_RELATIVE', 'DAYS_BIRTH', 'DAYS_EMPLOYED', 'DAYS_REGISTRATION', 'DAYS_ID_PUBLISH', 'OWN_CAR_AGE', 'FLAG_MOBIL', 'FLAG_EMP_PHONE', 'FLAG_WORK_PHONE', 'FLAG_CONT_MOBILE', 'FLAG_PHONE', 'FLAG_EMAIL', 'CNT_FAM_MEMBERS', 'REGION_RATING_CLIENT', 'REGION_RATING_CLIENT_W_CITY', 'HOUR_APPR_PROCESS_START', 'REG_REGION_NOT_LIVE_REGION', 'REG_REGION_NOT_WORK_REGION', 'LIVE_REGION_NOT_WORK_REGION', 'REG_CITY_NOT_LIVE_CITY', 'REG_CITY_NOT_WORK_CITY', 'LIVE_CITY_NOT_WORK_CITY', 'EXT_SOURCE_1', 'EXT_SOURCE_2', 'EXT_SOURCE_3', 'APARTMENTS_AVG', 'BASEMENTAREA_AVG', 'YEARS_BEGINEXPLUATATION_AVG', 'YEARS_BUILD_AVG', 'COMMONAREA_AVG', 'ELEVATORS_AVG', 'ENTRANCES_AVG', 'FLOORSMAX_AVG', 'FLOORSMIN_AVG', 'LANDAREA_AVG', 'LIVINGAPARTMENTS_AVG', 'LIVINGAREA_AVG', 'NONLIVINGAPARTMENTS_AVG', 'NONLIVINGAREA_AVG', 'APARTMENTS_MODE', 'BASEMENTAREA_MODE', 'YEARS_BEGINEXPLUATATION_MODE', 'YEARS_BUILD_MODE', 'COMMONAREA_MODE', 'ELEVATORS_MODE', 'ENTRANCES_MODE', 'FLOORSMAX_MODE', 'FLOORSMIN_MODE', 'LANDAREA_MODE', 'LIVINGAPARTMENTS_MODE', 'LIVINGAREA_MODE', 'NONLIVINGAPARTMENTS_MODE', 'NONLIVINGAREA_MODE', 'APARTMENTS_MEDI', 'BASEMENTAREA_MEDI', 'YEARS_BEGINEXPLUATATION_MEDI', 'YEARS_BUILD_MEDI', 'COMMONAREA_MEDI', 'ELEVATORS_MEDI', 'ENTRANCES_MEDI', 'FLOORSMAX_MEDI', 'FLOORSMIN_MEDI', 'LANDAREA_MEDI', 'LIVINGAPARTMENTS_MEDI', 'LIVINGAREA_MEDI', 'NONLIVINGAPARTMENTS_MEDI', 'NONLIVINGAREA_MEDI', 'TOTALAREA_MODE', 'OBS_30_CNT_SOCIAL_CIRCLE', 'DEF_30_CNT_SOCIAL_CIRCLE', 'OBS_60_CNT_SOCIAL_CIRCLE', 'DEF_60_CNT_SOCIAL_CIRCLE', 'DAYS_LAST_PHONE_CHANGE', 'FLAG_DOCUMENT_2', 'FLAG_DOCUMENT_3', 'FLAG_DOCUMENT_4', 'FLAG_DOCUMENT_5', 'FLAG_DOCUMENT_6', 'FLAG_DOCUMENT_7', 'FLAG_DOCUMENT_8', 'FLAG_DOCUMENT_9', 'FLAG_DOCUMENT_10', 'FLAG_DOCUMENT_11', 'FLAG_DOCUMENT_12', 'FLAG_DOCUMENT_13', 'FLAG_DOCUMENT_14', 'FLAG_DOCUMENT_15', 'FLAG_DOCUMENT_16', 'FLAG_DOCUMENT_17', 'FLAG_DOCUMENT_18', 'FLAG_DOCUMENT_19', 'FLAG_DOCUMENT_20', 'FLAG_DOCUMENT_21', 'AMT_REQ_CREDIT_BUREAU_HOUR', 'AMT_REQ_CREDIT_BUREAU_DAY', 'AMT_REQ_CREDIT_BUREAU_WEEK', 'AMT_REQ_CREDIT_BUREAU_MON', 'AMT_REQ_CREDIT_BUREAU_QRT', 'AMT_REQ_CREDIT_BUREAU_YEAR'] Categorical features : ['NAME_CONTRACT_TYPE', 'CODE_GENDER', 'FLAG_OWN_CAR', 'FLAG_OWN_REALTY', 'NAME_TYPE_SUITE', 'NAME_INCOME_TYPE', 'NAME_EDUCATION_TYPE', 'NAME_FAMILY_STATUS', 'NAME_HOUSING_TYPE', 'OCCUPATION_TYPE', 'WEEKDAY_APPR_PROCESS_START', 'ORGANIZATION_TYPE', 'FONDKAPREMONT_MODE', 'HOUSETYPE_MODE', 'WALLSMATERIAL_MODE', 'EMERGENCYSTATE_MODE']
percent = (datasets["application_train"].isnull().sum()/datasets["application_train"].isnull().count()*100).sort_values(ascending = False).round(2)
sum_missing = datasets["application_train"].isna().sum().sort_values(ascending = False)
missing_application_train_data = pd.concat([percent, sum_missing], axis=1, keys=['Percent', "Train Missing Count"])
missing_application_train_data.head(20)
| Percent | Train Missing Count | |
|---|---|---|
| COMMONAREA_MEDI | 69.87 | 214865 |
| COMMONAREA_AVG | 69.87 | 214865 |
| COMMONAREA_MODE | 69.87 | 214865 |
| NONLIVINGAPARTMENTS_MODE | 69.43 | 213514 |
| NONLIVINGAPARTMENTS_AVG | 69.43 | 213514 |
| NONLIVINGAPARTMENTS_MEDI | 69.43 | 213514 |
| FONDKAPREMONT_MODE | 68.39 | 210295 |
| LIVINGAPARTMENTS_MODE | 68.35 | 210199 |
| LIVINGAPARTMENTS_AVG | 68.35 | 210199 |
| LIVINGAPARTMENTS_MEDI | 68.35 | 210199 |
| FLOORSMIN_AVG | 67.85 | 208642 |
| FLOORSMIN_MODE | 67.85 | 208642 |
| FLOORSMIN_MEDI | 67.85 | 208642 |
| YEARS_BUILD_MEDI | 66.50 | 204488 |
| YEARS_BUILD_MODE | 66.50 | 204488 |
| YEARS_BUILD_AVG | 66.50 | 204488 |
| OWN_CAR_AGE | 65.99 | 202929 |
| LANDAREA_MEDI | 59.38 | 182590 |
| LANDAREA_MODE | 59.38 | 182590 |
| LANDAREA_AVG | 59.38 | 182590 |
msno.bar(Xy_train)
plt.show()
Xy_train.isnull().sum()
SK_ID_CURR 0
NAME_CONTRACT_TYPE 0
CODE_GENDER 0
FLAG_OWN_CAR 0
FLAG_OWN_REALTY 0
...
AMT_REQ_CREDIT_BUREAU_WEEK 26501
AMT_REQ_CREDIT_BUREAU_MON 26501
AMT_REQ_CREDIT_BUREAU_QRT 26501
AMT_REQ_CREDIT_BUREAU_YEAR 26501
TARGET 0
Length: 122, dtype: int64
We notice there are a lot of missing values in the dataset
missing_application_train_data
| Percent | Train Missing Count | |
|---|---|---|
| COMMONAREA_MEDI | 69.87 | 214865 |
| COMMONAREA_AVG | 69.87 | 214865 |
| COMMONAREA_MODE | 69.87 | 214865 |
| NONLIVINGAPARTMENTS_MODE | 69.43 | 213514 |
| NONLIVINGAPARTMENTS_AVG | 69.43 | 213514 |
| ... | ... | ... |
| NAME_HOUSING_TYPE | 0.00 | 0 |
| NAME_FAMILY_STATUS | 0.00 | 0 |
| NAME_EDUCATION_TYPE | 0.00 | 0 |
| NAME_INCOME_TYPE | 0.00 | 0 |
| SK_ID_CURR | 0.00 | 0 |
122 rows × 2 columns
percent_data = missing_application_train_data.iloc[:,0]
missing_application_train_data['Percent']
percent_data_df = pd.DataFrame(missing_application_train_data['Percent'], columns=['Percent'], index = missing_application_train_data.index)
percent_data_df = percent_data_df[percent_data_df['Percent'] != 0.0]
percent_data_df
| Percent | |
|---|---|
| COMMONAREA_MEDI | 69.87 |
| COMMONAREA_AVG | 69.87 |
| COMMONAREA_MODE | 69.87 |
| NONLIVINGAPARTMENTS_MODE | 69.43 |
| NONLIVINGAPARTMENTS_AVG | 69.43 |
| ... | ... |
| DEF_30_CNT_SOCIAL_CIRCLE | 0.33 |
| OBS_60_CNT_SOCIAL_CIRCLE | 0.33 |
| DEF_60_CNT_SOCIAL_CIRCLE | 0.33 |
| EXT_SOURCE_2 | 0.21 |
| AMT_GOODS_PRICE | 0.09 |
64 rows × 1 columns
plt.figure(figsize = (15,30))
plt.barh(y = percent_data_df.index, width = percent_data_df['Percent'])
plt.show()
We will construct the numerical pipeline and categorical pipeline
num_pipeline = Pipeline([
('scaler', StandardScaler()),
('imputer', SimpleImputer(strategy = 'median'))
])
cat_pipeline = Pipeline([
('imputer', SimpleImputer(strategy='most_frequent')),
('ohe', OneHotEncoder(sparse=False, handle_unknown="ignore"))
])
data_pipeline = ColumnTransformer([
("num_pipeline", num_pipeline, numerical_features),
("cat_pipeline", cat_pipeline, categorical_features)], remainder = 'drop', n_jobs = -1)
X_train_transformed = data_pipeline.fit_transform(X_train)
column_names = list(numerical_features) + \
list(data_pipeline.transformers_[1][1].named_steps["ohe"].get_feature_names(categorical_features))
display(pd.DataFrame(X_train_transformed, columns=column_names).head())
| SK_ID_CURR | CNT_CHILDREN | AMT_INCOME_TOTAL | AMT_CREDIT | AMT_ANNUITY | AMT_GOODS_PRICE | REGION_POPULATION_RELATIVE | DAYS_BIRTH | DAYS_EMPLOYED | DAYS_REGISTRATION | ... | HOUSETYPE_MODE_terraced house | WALLSMATERIAL_MODE_Block | WALLSMATERIAL_MODE_Mixed | WALLSMATERIAL_MODE_Monolithic | WALLSMATERIAL_MODE_Others | WALLSMATERIAL_MODE_Panel | WALLSMATERIAL_MODE_Stone, brick | WALLSMATERIAL_MODE_Wooden | EMERGENCYSTATE_MODE_No | EMERGENCYSTATE_MODE_Yes | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | -1.624501 | -0.578880 | -0.528796 | 1.188441 | 0.175508 | 0.976570 | -1.001111 | -1.110049 | -0.457711 | 0.275696 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 1.0 | 0.0 |
| 1 | 0.560257 | 0.810033 | 0.749824 | 1.304588 | 1.218472 | 1.584835 | 0.401388 | 0.575316 | -0.471672 | -0.681559 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 1.0 | 0.0 |
| 2 | -0.811727 | 0.810033 | 0.110514 | -0.452448 | -0.310068 | -0.483266 | 1.836900 | 0.340053 | -0.455801 | 1.250546 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 1.0 | 0.0 |
| 3 | -0.776179 | -0.578880 | 0.749824 | 1.188441 | 0.529898 | 0.976570 | 3.740104 | 0.272179 | -0.464324 | -1.133367 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 1.0 | 0.0 |
| 4 | 0.911520 | 0.810033 | -0.315693 | 0.559617 | -0.197845 | 0.404801 | -0.928785 | -0.585637 | -0.522481 | 0.798454 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 1.0 | 0.0 |
5 rows × 245 columns
X_train_transformed_df = pd.DataFrame(X_train_transformed, columns=column_names)
X_train_transformed_df
| SK_ID_CURR | CNT_CHILDREN | AMT_INCOME_TOTAL | AMT_CREDIT | AMT_ANNUITY | AMT_GOODS_PRICE | REGION_POPULATION_RELATIVE | DAYS_BIRTH | DAYS_EMPLOYED | DAYS_REGISTRATION | ... | HOUSETYPE_MODE_terraced house | WALLSMATERIAL_MODE_Block | WALLSMATERIAL_MODE_Mixed | WALLSMATERIAL_MODE_Monolithic | WALLSMATERIAL_MODE_Others | WALLSMATERIAL_MODE_Panel | WALLSMATERIAL_MODE_Stone, brick | WALLSMATERIAL_MODE_Wooden | EMERGENCYSTATE_MODE_No | EMERGENCYSTATE_MODE_Yes | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | -1.624501 | -0.578880 | -0.528796 | 1.188441 | 0.175508 | 0.976570 | -1.001111 | -1.110049 | -0.457711 | 0.275696 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 1.0 | 0.0 |
| 1 | 0.560257 | 0.810033 | 0.749824 | 1.304588 | 1.218472 | 1.584835 | 0.401388 | 0.575316 | -0.471672 | -0.681559 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 1.0 | 0.0 |
| 2 | -0.811727 | 0.810033 | 0.110514 | -0.452448 | -0.310068 | -0.483266 | 1.836900 | 0.340053 | -0.455801 | 1.250546 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 1.0 | 0.0 |
| 3 | -0.776179 | -0.578880 | 0.749824 | 1.188441 | 0.529898 | 0.976570 | 3.740104 | 0.272179 | -0.464324 | -1.133367 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 1.0 | 0.0 |
| 4 | 0.911520 | 0.810033 | -0.315693 | 0.559617 | -0.197845 | 0.404801 | -0.928785 | -0.585637 | -0.522481 | 0.798454 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 1.0 | 0.0 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 196801 | 0.105580 | 2.198946 | 0.962928 | 1.748359 | 0.769578 | 1.621331 | 0.312483 | 0.515927 | -0.483625 | 0.013181 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 1.0 | 0.0 |
| 196802 | -0.647728 | -0.578880 | 0.195756 | 2.349464 | 1.072985 | 2.193100 | 1.836900 | -0.513636 | -0.463595 | 0.891540 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 1.0 | 0.0 |
| 196803 | 0.569894 | 0.810033 | 0.110514 | -0.705648 | -0.784764 | -0.604919 | 1.081931 | -1.157515 | -0.453418 | 0.312022 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 1.0 | 0.0 |
| 196804 | 0.743462 | 2.198946 | -0.315693 | -0.640427 | -1.242051 | -0.726572 | 0.141333 | 0.624157 | -0.458156 | 1.358106 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 1.0 | 0.0 |
| 196805 | -1.731692 | 0.810033 | 0.536721 | 1.042029 | 0.270945 | 0.635942 | 3.740104 | 1.076110 | -0.456558 | 1.320928 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 1.0 | 0.0 |
196806 rows × 245 columns
X_train_transformed_df.isnull().sum()
SK_ID_CURR 0
CNT_CHILDREN 0
AMT_INCOME_TOTAL 0
AMT_CREDIT 0
AMT_ANNUITY 0
..
WALLSMATERIAL_MODE_Panel 0
WALLSMATERIAL_MODE_Stone, brick 0
WALLSMATERIAL_MODE_Wooden 0
EMERGENCYSTATE_MODE_No 0
EMERGENCYSTATE_MODE_Yes 0
Length: 245, dtype: int64
correlations = datasets["application_train"].corr()['TARGET'].sort_values()
print('Most Positive Correlations:\n', correlations.tail(10))
print('\nMost Negative Correlations:\n', correlations.head(10))
Most Positive Correlations: FLAG_DOCUMENT_3 0.044346 REG_CITY_NOT_LIVE_CITY 0.044395 FLAG_EMP_PHONE 0.045982 REG_CITY_NOT_WORK_CITY 0.050994 DAYS_ID_PUBLISH 0.051457 DAYS_LAST_PHONE_CHANGE 0.055218 REGION_RATING_CLIENT 0.058899 REGION_RATING_CLIENT_W_CITY 0.060893 DAYS_BIRTH 0.078239 TARGET 1.000000 Name: TARGET, dtype: float64 Most Negative Correlations: EXT_SOURCE_3 -0.178919 EXT_SOURCE_2 -0.160472 EXT_SOURCE_1 -0.155317 DAYS_EMPLOYED -0.044932 FLOORSMAX_AVG -0.044003 FLOORSMAX_MEDI -0.043768 FLOORSMAX_MODE -0.043226 AMT_GOODS_PRICE -0.039645 REGION_POPULATION_RELATIVE -0.037227 ELEVATORS_AVG -0.034199 Name: TARGET, dtype: float64
correlations = pd.DataFrame(correlations, columns = ['TARGET'])
correlations
| TARGET | |
|---|---|
| EXT_SOURCE_3 | -0.178919 |
| EXT_SOURCE_2 | -0.160472 |
| EXT_SOURCE_1 | -0.155317 |
| DAYS_EMPLOYED | -0.044932 |
| FLOORSMAX_AVG | -0.044003 |
| ... | ... |
| DAYS_LAST_PHONE_CHANGE | 0.055218 |
| REGION_RATING_CLIENT | 0.058899 |
| REGION_RATING_CLIENT_W_CITY | 0.060893 |
| DAYS_BIRTH | 0.078239 |
| TARGET | 1.000000 |
106 rows × 1 columns
# To get top 4 correlated attributes
correlations["abs_Target"] = np.abs(correlations["TARGET"])
display(correlations)
correlations.sort_values("abs_Target", ascending = False, inplace = True)
display(correlations)
correlations
| TARGET | abs_Target | |
|---|---|---|
| EXT_SOURCE_3 | -0.178919 | 0.178919 |
| EXT_SOURCE_2 | -0.160472 | 0.160472 |
| EXT_SOURCE_1 | -0.155317 | 0.155317 |
| DAYS_EMPLOYED | -0.044932 | 0.044932 |
| FLOORSMAX_AVG | -0.044003 | 0.044003 |
| ... | ... | ... |
| DAYS_LAST_PHONE_CHANGE | 0.055218 | 0.055218 |
| REGION_RATING_CLIENT | 0.058899 | 0.058899 |
| REGION_RATING_CLIENT_W_CITY | 0.060893 | 0.060893 |
| DAYS_BIRTH | 0.078239 | 0.078239 |
| TARGET | 1.000000 | 1.000000 |
106 rows × 2 columns
| TARGET | abs_Target | |
|---|---|---|
| TARGET | 1.000000 | 1.000000 |
| EXT_SOURCE_3 | -0.178919 | 0.178919 |
| EXT_SOURCE_2 | -0.160472 | 0.160472 |
| EXT_SOURCE_1 | -0.155317 | 0.155317 |
| DAYS_BIRTH | 0.078239 | 0.078239 |
| ... | ... | ... |
| FLAG_DOCUMENT_12 | -0.000756 | 0.000756 |
| FLAG_MOBIL | 0.000534 | 0.000534 |
| FLAG_CONT_MOBILE | 0.000370 | 0.000370 |
| FLAG_DOCUMENT_5 | -0.000316 | 0.000316 |
| FLAG_DOCUMENT_20 | 0.000215 | 0.000215 |
106 rows × 2 columns
| TARGET | abs_Target | |
|---|---|---|
| TARGET | 1.000000 | 1.000000 |
| EXT_SOURCE_3 | -0.178919 | 0.178919 |
| EXT_SOURCE_2 | -0.160472 | 0.160472 |
| EXT_SOURCE_1 | -0.155317 | 0.155317 |
| DAYS_BIRTH | 0.078239 | 0.078239 |
| ... | ... | ... |
| FLAG_DOCUMENT_12 | -0.000756 | 0.000756 |
| FLAG_MOBIL | 0.000534 | 0.000534 |
| FLAG_CONT_MOBILE | 0.000370 | 0.000370 |
| FLAG_DOCUMENT_5 | -0.000316 | 0.000316 |
| FLAG_DOCUMENT_20 | 0.000215 | 0.000215 |
106 rows × 2 columns
attributes = ["EXT_SOURCE_3", "EXT_SOURCE_2", "EXT_SOURCE_1","DAYS_BIRTH"]
sns.pairplot(data = datasets["application_train"], hue="TARGET", vars = attributes, height=3)
plt.show()
We can see the plot of the top 4 correlated attributes with the Target column. EXT_SOURCE_1 seems to be normally distributed while others are skewed but can be approximated to normal distribution.
correlations.isnull().count()
TARGET 106 abs_Target 106 dtype: int64
corr_target = correlations.drop(['abs_Target'], axis = 1)
corr_target= corr_target[:20].dropna()
sns.heatmap(corr_target, annot=True, fmt='.2f')
plt.show()
plt.figure(figsize=(15,7))
plot = corr_target[1:].plot(kind = 'bar', color = 'grey')
plt.setp(plot.get_xticklabels(), rotation=90)
plt.show()
<Figure size 1500x700 with 0 Axes>
datasets["application_train"]['TARGET'].astype(int).hist()
plt.show()
TARGET - 0: LOAN WAS REPAID 1: LOAN WAS NOT REPAID
cat_vars = list(categorical_features)[:4]
plt.figure(figsize=(15,4))
for idx, cat in enumerate(cat_vars):
plt.subplot(1, len(cat_vars), idx+1)
sns.countplot(Xy_train[cat], hue=Xy_train['TARGET'])
plt.show()
explainations = ['We can see that people who are not accompanied by anyone that is people having no dependents are able to repay loans easier.','Working class people usually require more loans as compared to other income type people.','We see that people who have education as Secondary/Secondary special require more loans than people of other education backgrounds.','People who are married have taken more loans and repaid them as compared to people having other than marriage family status','We see that people who live alone and in apartments require more loans than other people','Laborers have repaid more loans than people with other occupations','People have repaid more loans on Tuesday than any other day of the week','People belonging to Business organization have repaid more loans than other organization types','','People who live in flats have repaid more loans than people who live in other types of houses']
i = 0
for cat in list(categorical_features[4:14]):
plt.figure(figsize=(15,4))
plot = sns.countplot(x=cat, data=Xy_train, hue = Xy_train['TARGET'])
plt.setp(plot.get_xticklabels(), rotation=90)
plt.title(f'Categorical distribution of {cat} with respect to TARGET')
plt.show()
print(explainations[i])
i += 1
We can see that people who are not accompanied by anyone that is people having no dependents are able to repay loans easier.
Working class people usually require more loans as compared to other income type people.
We see that people who have education as Secondary/Secondary special require more loans than people of other education backgrounds.
People who are married have taken more loans and repaid them as compared to people having other than marriage family status
We see that people who live alone and in apartments require more loans than other people
Laborers have repaid more loans than people with other occupations
People have repaid more loans on Tuesday than any other day of the week
People belonging to Business organization have repaid more loans than other organization types
People who live in flats have repaid more loans than people who live in other types of houses
for cat in list(categorical_features[:14]):
plt.figure(figsize=(15,4))
plot = sns.countplot(x=cat, data=Xy_train)
plt.setp(plot.get_xticklabels(), rotation=90)
plt.title(f'Categorical distribution of {cat}')
plt.show()
for num in list(numerical_features[3:8]):
plt.figure(figsize=(7, 4))
sns.boxplot(x = 'TARGET', y = num, data = Xy_train)
plt.title(f'Numerical boxplot distribution of {num}')
#plt.setp(plot.get_xticklabels(), rotation=90)
plt.show()
We notice that there are many outliers in the data as seen in the box plot. We can visualize the median and the quantiles of each column data by these box plots.
for num in list(numerical_features[1:7]):
plt.figure(figsize=(7, 4))
plt.hist(Xy_train[num], edgecolor = 'k', bins = 25)
plt.xlabel(num)
plt.ylabel('count')
plt.title(f'Numerical histogram distribution of {num}')
#plt.setp(plot.get_xticklabels(), rotation=90)
plt.show()
Histogram plot shows the distribution of data over a range. We have visualized each numerical data column's data distribution.
plt.hist(datasets["application_train"]['DAYS_BIRTH'] / -365, edgecolor = 'k', bins = 25)
plt.title('Age of Client')
plt.xlabel('Age (years)')
plt.ylabel('Count')
plt.show()
Here we can conclude that people of age 30-50 take more loan applications
sns.countplot(x='OCCUPATION_TYPE', data=datasets["application_train"])
plt.title('Applicants Occupation')
plt.xticks(rotation=90)
plt.show()
Laborers require more loans as compared to other occupation type people
datasets.keys()
dict_keys(['application_train', 'application_test', 'bureau', 'bureau_balance', 'credit_card_balance', 'installments_payments', 'previous_application', 'POS_CASH_balance'])
len(datasets["application_train"]["SK_ID_CURR"].unique()) == datasets["application_train"].shape[0]
True
np.intersect1d(datasets["application_train"]["SK_ID_CURR"], datasets["application_test"]["SK_ID_CURR"])
array([], dtype=int64)
datasets["application_test"].shape
(48744, 121)
datasets["application_train"].shape
(307511, 122)
numerical_features = X_train.select_dtypes(include = ['int64', 'float64']).columns
categorical_features = X_train.select_dtypes(include = ['object', 'bool']).columns
print(f"\nNumerical features : {list(numerical_features)}")
print(f"\nCategorical features : {list(categorical_features)}")
Numerical features : ['SK_ID_CURR', 'CNT_CHILDREN', 'AMT_INCOME_TOTAL', 'AMT_CREDIT', 'AMT_ANNUITY', 'AMT_GOODS_PRICE', 'REGION_POPULATION_RELATIVE', 'DAYS_BIRTH', 'DAYS_EMPLOYED', 'DAYS_REGISTRATION', 'DAYS_ID_PUBLISH', 'OWN_CAR_AGE', 'FLAG_MOBIL', 'FLAG_EMP_PHONE', 'FLAG_WORK_PHONE', 'FLAG_CONT_MOBILE', 'FLAG_PHONE', 'FLAG_EMAIL', 'CNT_FAM_MEMBERS', 'REGION_RATING_CLIENT', 'REGION_RATING_CLIENT_W_CITY', 'HOUR_APPR_PROCESS_START', 'REG_REGION_NOT_LIVE_REGION', 'REG_REGION_NOT_WORK_REGION', 'LIVE_REGION_NOT_WORK_REGION', 'REG_CITY_NOT_LIVE_CITY', 'REG_CITY_NOT_WORK_CITY', 'LIVE_CITY_NOT_WORK_CITY', 'EXT_SOURCE_1', 'EXT_SOURCE_2', 'EXT_SOURCE_3', 'APARTMENTS_AVG', 'BASEMENTAREA_AVG', 'YEARS_BEGINEXPLUATATION_AVG', 'YEARS_BUILD_AVG', 'COMMONAREA_AVG', 'ELEVATORS_AVG', 'ENTRANCES_AVG', 'FLOORSMAX_AVG', 'FLOORSMIN_AVG', 'LANDAREA_AVG', 'LIVINGAPARTMENTS_AVG', 'LIVINGAREA_AVG', 'NONLIVINGAPARTMENTS_AVG', 'NONLIVINGAREA_AVG', 'APARTMENTS_MODE', 'BASEMENTAREA_MODE', 'YEARS_BEGINEXPLUATATION_MODE', 'YEARS_BUILD_MODE', 'COMMONAREA_MODE', 'ELEVATORS_MODE', 'ENTRANCES_MODE', 'FLOORSMAX_MODE', 'FLOORSMIN_MODE', 'LANDAREA_MODE', 'LIVINGAPARTMENTS_MODE', 'LIVINGAREA_MODE', 'NONLIVINGAPARTMENTS_MODE', 'NONLIVINGAREA_MODE', 'APARTMENTS_MEDI', 'BASEMENTAREA_MEDI', 'YEARS_BEGINEXPLUATATION_MEDI', 'YEARS_BUILD_MEDI', 'COMMONAREA_MEDI', 'ELEVATORS_MEDI', 'ENTRANCES_MEDI', 'FLOORSMAX_MEDI', 'FLOORSMIN_MEDI', 'LANDAREA_MEDI', 'LIVINGAPARTMENTS_MEDI', 'LIVINGAREA_MEDI', 'NONLIVINGAPARTMENTS_MEDI', 'NONLIVINGAREA_MEDI', 'TOTALAREA_MODE', 'OBS_30_CNT_SOCIAL_CIRCLE', 'DEF_30_CNT_SOCIAL_CIRCLE', 'OBS_60_CNT_SOCIAL_CIRCLE', 'DEF_60_CNT_SOCIAL_CIRCLE', 'DAYS_LAST_PHONE_CHANGE', 'FLAG_DOCUMENT_2', 'FLAG_DOCUMENT_3', 'FLAG_DOCUMENT_4', 'FLAG_DOCUMENT_5', 'FLAG_DOCUMENT_6', 'FLAG_DOCUMENT_7', 'FLAG_DOCUMENT_8', 'FLAG_DOCUMENT_9', 'FLAG_DOCUMENT_10', 'FLAG_DOCUMENT_11', 'FLAG_DOCUMENT_12', 'FLAG_DOCUMENT_13', 'FLAG_DOCUMENT_14', 'FLAG_DOCUMENT_15', 'FLAG_DOCUMENT_16', 'FLAG_DOCUMENT_17', 'FLAG_DOCUMENT_18', 'FLAG_DOCUMENT_19', 'FLAG_DOCUMENT_20', 'FLAG_DOCUMENT_21', 'AMT_REQ_CREDIT_BUREAU_HOUR', 'AMT_REQ_CREDIT_BUREAU_DAY', 'AMT_REQ_CREDIT_BUREAU_WEEK', 'AMT_REQ_CREDIT_BUREAU_MON', 'AMT_REQ_CREDIT_BUREAU_QRT', 'AMT_REQ_CREDIT_BUREAU_YEAR'] Categorical features : ['NAME_CONTRACT_TYPE', 'CODE_GENDER', 'FLAG_OWN_CAR', 'FLAG_OWN_REALTY', 'NAME_TYPE_SUITE', 'NAME_INCOME_TYPE', 'NAME_EDUCATION_TYPE', 'NAME_FAMILY_STATUS', 'NAME_HOUSING_TYPE', 'OCCUPATION_TYPE', 'WEEKDAY_APPR_PROCESS_START', 'ORGANIZATION_TYPE', 'FONDKAPREMONT_MODE', 'HOUSETYPE_MODE', 'WALLSMATERIAL_MODE', 'EMERGENCYSTATE_MODE']
len(list(numerical_features))
105
len(list(categorical_features))
16
total_input_features = len(list(numerical_features)) + len(list(categorical_features))
total_input_features
121
The objective function for the learning a binomial logistic regression model (log loss) can be stated as follows:
$$ \underset{\mathbf{\theta}}{\operatorname{argmin}}\left[\text{CXE}\right] = \underset{\mathbf{\theta}}{\operatorname{argmin}} \left[ -\dfrac{1}{m} \sum\limits_{i=1}^{m}{\left[ y^{(i)} log\left(\hat{p}^{(i)}\right) + (1 - y^{(i)}) log\left(1 - \hat{p}^{(i)}\right)\right]} \right] $$The corresponding gradient function of partial derivatives is as follows (after a little bit of math):
$$ \begin{aligned} \nabla_\text{CXE}(\mathbf{\theta}) &= \begin{pmatrix} \frac{\partial}{\partial \theta_0} \text{CXE}(\mathbf{\theta}) \\ \frac{\partial}{\partial \theta_1} \text{CXE}(\mathbf{\theta}) \\ \vdots \\ \frac{\partial}{\partial \theta_n} \text{CXE}(\mathbf{\theta}) \end{pmatrix}\\ &= \dfrac{2}{m} \mathbf{X}^T \cdot (\hat{p}_y - \mathbf{y}) \end{aligned} $$For completeness learning a binomial logistic regression model via gradient descent would use the following step iteratively:
$$ \mathbf{\theta}^{(\text{next step})} = \mathbf{\theta} - \eta \nabla_\text{CXE}(\mathbf{\theta}) $$num_pipeline = Pipeline([
('scaler', StandardScaler()),
('imputer', SimpleImputer(strategy = 'median'))
])
cat_pipeline = Pipeline([
('imputer', SimpleImputer(strategy='most_frequent')),
('ohe', OneHotEncoder(sparse=False, handle_unknown="ignore"))
])
results = pd.DataFrame(columns = ["Pipeline", "Dataset", "TrainAcc", "ValidAcc", "TestAcc","TrainROC","TestROC","ValidROC"])
def model_logreg(X_train):
data_pipeline = ColumnTransformer([
("num_pipeline", num_pipeline, numerical_features),
("cat_pipeline", cat_pipeline, categorical_features)], remainder = 'drop', n_jobs = -1)
X_train_transformed = data_pipeline.fit_transform(X_train)
column_names = list(numerical_features) + \
list(data_pipeline.transformers_[1][1].named_steps["ohe"].get_feature_names(categorical_features))
X_train_transformed_df = pd.DataFrame(X_train_transformed, columns=column_names)
display(X_train_transformed_df.head())
clf_pipe = make_pipeline(data_pipeline, LogisticRegression())
clf_pipe.fit(X_train, y_train)
train_acc = clf_pipe.score(X_train, y_train)
validAcc = clf_pipe.score(X_valid, y_valid)
testAcc = clf_pipe.score(X_test, y_test)
## Plotting AUC / ROC
ns_probs = [0 for _ in range(len(y_test))]
lr_probs = clf_pipe.predict_proba(X_test)
# keep probabilities for the positive outcome only
lr_probs = lr_probs[:, 1]
# calculate scores
ns_auc = roc_auc_score(y_test, ns_probs)
lr_auc = roc_auc_score(y_test, lr_probs)
print('No Skill: ROC AUC=%.3f' % (ns_auc))
print('Logistic: ROC AUC=%.3f' % (lr_auc))
# calculate roc curves
ns_fpr, ns_tpr, _ = roc_curve(y_test, ns_probs)
lr_fpr, lr_tpr, _ = roc_curve(y_test, lr_probs)
# plot the roc curve for the model
plt.plot(ns_fpr, ns_tpr, linestyle='--', label='No Skill')
plt.plot(lr_fpr, lr_tpr, marker='.', label='Logistic')
# axis labels
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
# show the legend
plt.legend()
# show the plot
plt.show()
train_roc = roc_auc_score(y_train, clf_pipe.predict_proba(X_train)[:, 1])
test_roc = roc_auc_score(y_test, clf_pipe.predict_proba(X_test)[:, 1])
valid_roc = roc_auc_score(y_valid, clf_pipe.predict_proba(X_valid)[:, 1])
results.loc[len(results)] = ["Baseline Logistic Regression","HCDR",f"{train_acc*100:8.2f}%",
f"{validAcc*100:8.2f}%", f"{testAcc*100:8.2f}%",f"{np.round(train_roc,4)}",f"{np.round(test_roc,4)}",f"{np.round(valid_roc,4)}"]
display(results)
return clf_pipe
logreg = model_logreg(X_train)
| SK_ID_CURR | CNT_CHILDREN | AMT_INCOME_TOTAL | AMT_CREDIT | AMT_ANNUITY | AMT_GOODS_PRICE | REGION_POPULATION_RELATIVE | DAYS_BIRTH | DAYS_EMPLOYED | DAYS_REGISTRATION | ... | HOUSETYPE_MODE_terraced house | WALLSMATERIAL_MODE_Block | WALLSMATERIAL_MODE_Mixed | WALLSMATERIAL_MODE_Monolithic | WALLSMATERIAL_MODE_Others | WALLSMATERIAL_MODE_Panel | WALLSMATERIAL_MODE_Stone, brick | WALLSMATERIAL_MODE_Wooden | EMERGENCYSTATE_MODE_No | EMERGENCYSTATE_MODE_Yes | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | -1.624501 | -0.578880 | -0.528796 | 1.188441 | 0.175508 | 0.976570 | -1.001111 | -1.110049 | -0.457711 | 0.275696 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 1.0 | 0.0 |
| 1 | 0.560257 | 0.810033 | 0.749824 | 1.304588 | 1.218472 | 1.584835 | 0.401388 | 0.575316 | -0.471672 | -0.681559 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 1.0 | 0.0 |
| 2 | -0.811727 | 0.810033 | 0.110514 | -0.452448 | -0.310068 | -0.483266 | 1.836900 | 0.340053 | -0.455801 | 1.250546 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 1.0 | 0.0 |
| 3 | -0.776179 | -0.578880 | 0.749824 | 1.188441 | 0.529898 | 0.976570 | 3.740104 | 0.272179 | -0.464324 | -1.133367 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 1.0 | 0.0 |
| 4 | 0.911520 | 0.810033 | -0.315693 | 0.559617 | -0.197845 | 0.404801 | -0.928785 | -0.585637 | -0.522481 | 0.798454 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 1.0 | 0.0 |
5 rows × 245 columns
No Skill: ROC AUC=0.500 Logistic: ROC AUC=0.745
| Pipeline | Dataset | TrainAcc | ValidAcc | TestAcc | TrainROC | TestROC | ValidROC | |
|---|---|---|---|---|---|---|---|---|
| 0 | Baseline Logistic Regression | HCDR | 91.95% | 91.77% | 91.93% | 0.7491 | 0.7451 | 0.7407 |
In both cases the cost functions try to find most homogeneous branches, or branches having groups with similar responses.
Regression : sum(y — prediction)²
Classification : G = sum(pk * (1 — pk))
A Gini score gives an idea of how good a split is by how mixed the response classes are in the groups created by the split. Here, pk is proportion of same class inputs present in a particular group.
The function to measure the quality of a split. Supported criteria are “gini” for the Gini impurity and “entropy” for the information gain. Information gain uses the entropy measure as the impurity measure and splits a node such that it gives the most amount of information gain. Whereas Gini Impurity measures the divergences between the probability distributions of the target attribute’s values and splits a node such that it gives the least amount of impurity.
Gini : $\Large 1 - \sum^m_{i=1}(P_j^2)$
Entropy : $\Large \sum^m_{i=1}\left(P_j\cdot\:\log\:\left(P_j\right)\:)\right)$
To calculate the importance of each feature, we will mention the decision point itself and its child nodes as well. The following formula covers the calculation of feature importance.
For each decision tree, Scikit-learn calculates a nodes importance using Gini Importance, assuming only two child nodes (binary tree):
$\Large ni_j = w_jC_j - w_{left(j)}C_{left(j)} - w_{right(j)}C_{right(j)}$
Where
ni_j= the importance of node j
w_j = weighted number of samples reaching node j
C_j= the impurity value of node j
left(j) = child node from left split on node j
right(j) = child node from right split on node j
The importance for each feature on a decision tree is then calculated as:
$\Large fi_i = \frac{\sum_{j:node \hspace{0.1cm} j \hspace{0.1cm} splits \hspace{0.1cm} on \hspace{0.1cm} feature \hspace{0.1cm} i}ni_j}{\sum_{k \hspace{0.1cm} \epsilon \hspace{0.1cm} all \hspace{0.1cm} nodes }ni_k}$
$fi_i$ is feature importance for $i^{th}$ feature
These can then be normalized to a value between 0 and 1 by dividing by the sum of all feature importance values:
$\Large normfi_i = \frac{fi_i}{\sum_{j \hspace{0.1cm} \epsilon \hspace{0.1cm} all \hspace{0.1cm} features}fi_j}$
criterion{“gini”, “entropy”}, default=”gini” :The function to measure the quality of a split. Supported criteria are “gini” for the Gini impurity and “entropy” for the information gain.
max_depth, default=None : The maximum depth of the tree. If None, then nodes are expanded until all leaves are pure or until all leaves contain less than min_samples_split samples.
min_samples_leaf int or float, default=1 : The minimum number of samples required to be at a leaf node. A split point at any depth will only be considered if it leaves at least min_samples_leaf training samples in each of the left and right branches. This may have the effect of smoothing the model, especially in regression.
def model_dt(X_train):
tree = DecisionTreeClassifier(criterion='gini', random_state = 42)
data_pipeline_dt = make_pipeline(data_pipeline, tree)
data_pipeline_dt.fit(X_train, y_train)
train_acc = data_pipeline_dt.score(X_train, y_train)
validAcc = data_pipeline_dt.score(X_valid, y_valid)
testAcc = data_pipeline_dt.score(X_test, y_test)
predictions = data_pipeline_dt.predict_proba(X_test)
print ("Score",roc_auc_score(y_test, predictions[:,1]))
fpr, tpr, _ = roc_curve(y_test, predictions[:,1])
plt.clf()
plt.plot(fpr, tpr)
plt.xlabel('FPR')
plt.ylabel('TPR')
plt.title('ROC curve')
plt.show()
features = X_train.columns
importances = tree.feature_importances_
indices = np.argsort(importances)
top_feature = indices[0]
plt.title('Feature Importances')
plt.barh(range(len(indices)), importances[indices], color='b', align='center')
#plt.yticks(range(len(indices)), [features[i] for i in indices])
plt.xlabel('Relative Importance')
plt.grid()
plt.show();
print(f"The feature having highest importance is {features[top_feature]}")
train_roc = roc_auc_score(y_train, data_pipeline_dt.predict_proba(X_train)[:, 1])
test_roc = roc_auc_score(y_test, data_pipeline_dt.predict_proba(X_test)[:, 1])
valid_roc = roc_auc_score(y_valid, data_pipeline_dt.predict_proba(X_valid)[:, 1])
results.loc[len(results)] = ["Baseline Decision Tree","HCDR",f"{train_acc*100:8.2f}%",
f"{validAcc*100:8.2f}%", f"{testAcc*100:8.2f}%",f"{np.round(train_roc,4)}",f"{np.round(test_roc,4)}",f"{np.round(valid_roc,4)}"]
display(results)
model_dt(X_train)
Score 0.5388834038729503
The feature having highest importance is NAME_EDUCATION_TYPE
| Pipeline | Dataset | TrainAcc | ValidAcc | TestAcc | TrainROC | TestROC | ValidROC | |
|---|---|---|---|---|---|---|---|---|
| 0 | Baseline Logistic Regression | HCDR | 91.95% | 91.77% | 91.93% | 0.7491 | 0.7451 | 0.7407 |
| 1 | Baseline Decision Tree | HCDR | 100.00% | 85.08% | 85.46% | 1.0 | 0.5389 | 0.5333 |
A random forest is a meta estimator that fits a number of decision tree classifiers on various sub-samples of the dataset and uses averaging to improve the predictive accuracy and control over-fitting.
n_estimators, default=100 : The number of trees in the forest.
max_depthint, default=None : The maximum depth of the tree. If None, then nodes are expanded until all leaves are pure or until all leaves contain less than min_samples_split samples.
max_features : The number of features to consider when looking for the best split.
min_impurity_decreasefloat, default=0.0 : Threshold for early stopping in tree growth. A node will be split if this split induces a decrease of the impurity greater than or equal to this value.
bootstrapbool, default=True : Whether bootstrap samples are used when building trees. If False, the whole dataset is used to build each tree.
Yet another great quality of Random Forests is that they make it easy to measure the relative importance of each feature. Scikit-Learn measures a feature’s importance by looking at how much the tree nodes that use that feature reduce impurity on average (across all trees in the forest). More precisely, it is a weighted average, where each node’s weight is equal to the number of training samples that are associated with it.
def model_rf(X_train):
RF = RandomForestClassifier(random_state = 42,n_estimators=20, criterion='gini', max_depth=6)
data_pipeline_rf = make_pipeline(data_pipeline, RF)
data_pipeline_rf.fit(X_train, y_train)
train_acc = data_pipeline_rf.score(X_train, y_train)
validAcc = data_pipeline_rf.score(X_valid, y_valid)
testAcc = data_pipeline_rf.score(X_test, y_test)
predictions = data_pipeline_rf.predict_proba(X_test)
print ("Score",roc_auc_score(y_test, predictions[:,1]))
fpr, tpr, _ = roc_curve(y_test, predictions[:,1])
plt.clf()
plt.plot(fpr, tpr)
plt.xlabel('FPR')
plt.ylabel('TPR')
plt.title('ROC curve')
plt.show()
train_roc = roc_auc_score(y_train, data_pipeline_rf.predict_proba(X_train)[:, 1])
test_roc = roc_auc_score(y_test, data_pipeline_rf.predict_proba(X_test)[:, 1])
valid_roc = roc_auc_score(y_valid, data_pipeline_rf.predict_proba(X_valid)[:, 1])
results.loc[len(results)] = ["Baseline Random Forest","HCDR",f"{train_acc*100:8.2f}%",
f"{validAcc*100:8.2f}%", f"{testAcc*100:8.2f}%",f"{np.round(train_roc,4)}",f"{np.round(test_roc,4)}",f"{np.round(valid_roc,4)}"]
display(results)
return data_pipeline_rf,RF
rf_pipe, RF = model_rf(X_train)
Score 0.7202735379743133
| Pipeline | Dataset | TrainAcc | ValidAcc | TestAcc | TrainROC | TestROC | ValidROC | |
|---|---|---|---|---|---|---|---|---|
| 0 | Baseline Logistic Regression | HCDR | 91.95% | 91.77% | 91.93% | 0.7491 | 0.7451 | 0.7407 |
| 1 | Baseline Decision Tree | HCDR | 100.00% | 85.08% | 85.46% | 1.0 | 0.5389 | 0.5333 |
| 2 | Baseline Random Forest | HCDR | 91.95% | 91.78% | 91.95% | 0.7355 | 0.7203 | 0.7146 |
Feature Importances of Random Forest Classifier
features = X_train.columns
#print(len(features))
importances = RF.feature_importances_
indices = np.argsort(importances)[:len(features)]
#print(len(indices))
plt.title('Feature Importances')
plt.barh(range(len(indices)), importances[indices], color='b', align='center')
#plt.yticks(range(len(indices)), [i for i in features])
plt.xlabel('Relative Importance')
plt.ylabel('Input Features')
plt.grid()
plt.show();
bureau_data = datasets['bureau']
bureau_data
| SK_ID_CURR | SK_ID_BUREAU | CREDIT_ACTIVE | CREDIT_CURRENCY | DAYS_CREDIT | CREDIT_DAY_OVERDUE | DAYS_CREDIT_ENDDATE | DAYS_ENDDATE_FACT | AMT_CREDIT_MAX_OVERDUE | CNT_CREDIT_PROLONG | AMT_CREDIT_SUM | AMT_CREDIT_SUM_DEBT | AMT_CREDIT_SUM_LIMIT | AMT_CREDIT_SUM_OVERDUE | CREDIT_TYPE | DAYS_CREDIT_UPDATE | AMT_ANNUITY | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 215354 | 5714462 | Closed | currency 1 | -497 | 0 | -153.0 | -153.0 | NaN | 0 | 91323.00 | 0.0 | NaN | 0.0 | Consumer credit | -131 | NaN |
| 1 | 215354 | 5714463 | Active | currency 1 | -208 | 0 | 1075.0 | NaN | NaN | 0 | 225000.00 | 171342.0 | NaN | 0.0 | Credit card | -20 | NaN |
| 2 | 215354 | 5714464 | Active | currency 1 | -203 | 0 | 528.0 | NaN | NaN | 0 | 464323.50 | NaN | NaN | 0.0 | Consumer credit | -16 | NaN |
| 3 | 215354 | 5714465 | Active | currency 1 | -203 | 0 | NaN | NaN | NaN | 0 | 90000.00 | NaN | NaN | 0.0 | Credit card | -16 | NaN |
| 4 | 215354 | 5714466 | Active | currency 1 | -629 | 0 | 1197.0 | NaN | 77674.5 | 0 | 2700000.00 | NaN | NaN | 0.0 | Consumer credit | -21 | NaN |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 1716423 | 259355 | 5057750 | Active | currency 1 | -44 | 0 | -30.0 | NaN | 0.0 | 0 | 11250.00 | 11250.0 | 0.0 | 0.0 | Microloan | -19 | NaN |
| 1716424 | 100044 | 5057754 | Closed | currency 1 | -2648 | 0 | -2433.0 | -2493.0 | 5476.5 | 0 | 38130.84 | 0.0 | 0.0 | 0.0 | Consumer credit | -2493 | NaN |
| 1716425 | 100044 | 5057762 | Closed | currency 1 | -1809 | 0 | -1628.0 | -970.0 | NaN | 0 | 15570.00 | NaN | NaN | 0.0 | Consumer credit | -967 | NaN |
| 1716426 | 246829 | 5057770 | Closed | currency 1 | -1878 | 0 | -1513.0 | -1513.0 | NaN | 0 | 36000.00 | 0.0 | 0.0 | 0.0 | Consumer credit | -1508 | NaN |
| 1716427 | 246829 | 5057778 | Closed | currency 1 | -463 | 0 | NaN | -387.0 | NaN | 0 | 22500.00 | 0.0 | NaN | 0.0 | Microloan | -387 | NaN |
1716428 rows × 17 columns
bureau_data['DAYS_CREDIT'].value_counts()
-364 1330
-336 1248
-273 1238
-357 1218
-343 1203
...
-4 113
-3 74
-2 42
0 25
-1 17
Name: DAYS_CREDIT, Length: 2923, dtype: int64
msno.bar(bureau_data)
plt.show()
corr = bureau_data.corr()['DAYS_CREDIT'].sort_values(ascending = False)
print('Most Positive Correlations:\n', corr.head(10))
print('\nMost Negative Correlations:\n', corr.tail(10))
Most Positive Correlations: DAYS_CREDIT 1.000000 DAYS_ENDDATE_FACT 0.875359 DAYS_CREDIT_UPDATE 0.688771 DAYS_CREDIT_ENDDATE 0.225682 AMT_CREDIT_SUM_DEBT 0.135397 AMT_CREDIT_SUM 0.050883 AMT_CREDIT_SUM_LIMIT 0.025140 SK_ID_BUREAU 0.013015 AMT_ANNUITY 0.005676 SK_ID_CURR 0.000266 Name: DAYS_CREDIT, dtype: float64 Most Negative Correlations: AMT_CREDIT_SUM_DEBT 0.135397 AMT_CREDIT_SUM 0.050883 AMT_CREDIT_SUM_LIMIT 0.025140 SK_ID_BUREAU 0.013015 AMT_ANNUITY 0.005676 SK_ID_CURR 0.000266 AMT_CREDIT_SUM_OVERDUE -0.000383 AMT_CREDIT_MAX_OVERDUE -0.014724 CREDIT_DAY_OVERDUE -0.027266 CNT_CREDIT_PROLONG -0.030460 Name: DAYS_CREDIT, dtype: float64
bureau_corr = pd.DataFrame(corr, columns = ['DAYS_CREDIT'])
bureau_corr
| DAYS_CREDIT | |
|---|---|
| DAYS_CREDIT | 1.000000 |
| DAYS_ENDDATE_FACT | 0.875359 |
| DAYS_CREDIT_UPDATE | 0.688771 |
| DAYS_CREDIT_ENDDATE | 0.225682 |
| AMT_CREDIT_SUM_DEBT | 0.135397 |
| AMT_CREDIT_SUM | 0.050883 |
| AMT_CREDIT_SUM_LIMIT | 0.025140 |
| SK_ID_BUREAU | 0.013015 |
| AMT_ANNUITY | 0.005676 |
| SK_ID_CURR | 0.000266 |
| AMT_CREDIT_SUM_OVERDUE | -0.000383 |
| AMT_CREDIT_MAX_OVERDUE | -0.014724 |
| CREDIT_DAY_OVERDUE | -0.027266 |
| CNT_CREDIT_PROLONG | -0.030460 |
bureau_corr= bureau_corr[:20].dropna()
sns.heatmap(bureau_corr, annot = True ,fmt='.2f')
plt.show()
bureau_data['CREDIT_TYPE'].value_counts()
Consumer credit 1251615 Credit card 402195 Car loan 27690 Mortgage 18391 Microloan 12413 Loan for business development 1975 Another type of loan 1017 Unknown type of loan 555 Loan for working capital replenishment 469 Cash loan (non-earmarked) 56 Real estate loan 27 Loan for the purchase of equipment 19 Loan for purchase of shares (margin lending) 4 Mobile operator loan 1 Interbank credit 1 Name: CREDIT_TYPE, dtype: int64
bureau_data['CREDIT_ACTIVE'].value_counts()
Closed 1079273 Active 630607 Sold 6527 Bad debt 21 Name: CREDIT_ACTIVE, dtype: int64
bureau_data['CREDIT_ACTIVE_CLASSIFY'] = bureau_data['CREDIT_ACTIVE']
def classify(x):
if x == 'Closed':
y = 0
else:
y = 1
return y
bureau_data['CREDIT_ACTIVE_CLASSIFY'] = bureau_data.apply(lambda x: classify(x.CREDIT_ACTIVE), axis = 1)
bureau_data['CREDIT_ACTIVE_CLASSIFY'].value_counts()
0 1079273 1 637155 Name: CREDIT_ACTIVE_CLASSIFY, dtype: int64
active_loan = bureau_data.groupby(by = ['SK_ID_CURR'])['CREDIT_ACTIVE_CLASSIFY'].mean().reset_index().rename(index=str, columns={'CREDIT_ACTIVE_CLASSIFY': 'ACTIVE_LOANS_PERCENTAGE'})
active_loan
| SK_ID_CURR | ACTIVE_LOANS_PERCENTAGE | |
|---|---|---|
| 0 | 100001 | 0.428571 |
| 1 | 100002 | 0.250000 |
| 2 | 100003 | 0.250000 |
| 3 | 100004 | 0.000000 |
| 4 | 100005 | 0.666667 |
| ... | ... | ... |
| 305806 | 456249 | 0.153846 |
| 305807 | 456250 | 0.666667 |
| 305808 | 456253 | 0.500000 |
| 305809 | 456254 | 0.000000 |
| 305810 | 456255 | 0.454545 |
305811 rows × 2 columns
bureau_data = bureau_data.merge(active_loan, on = ['SK_ID_CURR'], how = 'left')
bureau_data
| SK_ID_CURR | SK_ID_BUREAU | CREDIT_ACTIVE | CREDIT_CURRENCY | DAYS_CREDIT | CREDIT_DAY_OVERDUE | DAYS_CREDIT_ENDDATE | DAYS_ENDDATE_FACT | AMT_CREDIT_MAX_OVERDUE | CNT_CREDIT_PROLONG | AMT_CREDIT_SUM | AMT_CREDIT_SUM_DEBT | AMT_CREDIT_SUM_LIMIT | AMT_CREDIT_SUM_OVERDUE | CREDIT_TYPE | DAYS_CREDIT_UPDATE | AMT_ANNUITY | CREDIT_ACTIVE_CLASSIFY | ACTIVE_LOANS_PERCENTAGE | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 215354 | 5714462 | Closed | currency 1 | -497 | 0 | -153.0 | -153.0 | NaN | 0 | 91323.00 | 0.0 | NaN | 0.0 | Consumer credit | -131 | NaN | 0 | 0.545455 |
| 1 | 215354 | 5714463 | Active | currency 1 | -208 | 0 | 1075.0 | NaN | NaN | 0 | 225000.00 | 171342.0 | NaN | 0.0 | Credit card | -20 | NaN | 1 | 0.545455 |
| 2 | 215354 | 5714464 | Active | currency 1 | -203 | 0 | 528.0 | NaN | NaN | 0 | 464323.50 | NaN | NaN | 0.0 | Consumer credit | -16 | NaN | 1 | 0.545455 |
| 3 | 215354 | 5714465 | Active | currency 1 | -203 | 0 | NaN | NaN | NaN | 0 | 90000.00 | NaN | NaN | 0.0 | Credit card | -16 | NaN | 1 | 0.545455 |
| 4 | 215354 | 5714466 | Active | currency 1 | -629 | 0 | 1197.0 | NaN | 77674.5 | 0 | 2700000.00 | NaN | NaN | 0.0 | Consumer credit | -21 | NaN | 1 | 0.545455 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 1716423 | 259355 | 5057750 | Active | currency 1 | -44 | 0 | -30.0 | NaN | 0.0 | 0 | 11250.00 | 11250.0 | 0.0 | 0.0 | Microloan | -19 | NaN | 1 | 1.000000 |
| 1716424 | 100044 | 5057754 | Closed | currency 1 | -2648 | 0 | -2433.0 | -2493.0 | 5476.5 | 0 | 38130.84 | 0.0 | 0.0 | 0.0 | Consumer credit | -2493 | NaN | 0 | 0.545455 |
| 1716425 | 100044 | 5057762 | Closed | currency 1 | -1809 | 0 | -1628.0 | -970.0 | NaN | 0 | 15570.00 | NaN | NaN | 0.0 | Consumer credit | -967 | NaN | 0 | 0.545455 |
| 1716426 | 246829 | 5057770 | Closed | currency 1 | -1878 | 0 | -1513.0 | -1513.0 | NaN | 0 | 36000.00 | 0.0 | 0.0 | 0.0 | Consumer credit | -1508 | NaN | 0 | 0.258065 |
| 1716427 | 246829 | 5057778 | Closed | currency 1 | -463 | 0 | NaN | -387.0 | NaN | 0 | 22500.00 | 0.0 | NaN | 0.0 | Microloan | -387 | NaN | 0 | 0.258065 |
1716428 rows × 19 columns
bureau_data.drop(['AMT_ANNUITY','DAYS_ENDDATE_FACT','CREDIT_CURRENCY','CREDIT_ACTIVE_CLASSIFY'], axis = 1, inplace = True)
bureau_data
| SK_ID_CURR | SK_ID_BUREAU | CREDIT_ACTIVE | DAYS_CREDIT | CREDIT_DAY_OVERDUE | DAYS_CREDIT_ENDDATE | AMT_CREDIT_MAX_OVERDUE | CNT_CREDIT_PROLONG | AMT_CREDIT_SUM | AMT_CREDIT_SUM_DEBT | AMT_CREDIT_SUM_LIMIT | AMT_CREDIT_SUM_OVERDUE | CREDIT_TYPE | DAYS_CREDIT_UPDATE | ACTIVE_LOANS_PERCENTAGE | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 215354 | 5714462 | Closed | -497 | 0 | -153.0 | NaN | 0 | 91323.00 | 0.0 | NaN | 0.0 | Consumer credit | -131 | 0.545455 |
| 1 | 215354 | 5714463 | Active | -208 | 0 | 1075.0 | NaN | 0 | 225000.00 | 171342.0 | NaN | 0.0 | Credit card | -20 | 0.545455 |
| 2 | 215354 | 5714464 | Active | -203 | 0 | 528.0 | NaN | 0 | 464323.50 | NaN | NaN | 0.0 | Consumer credit | -16 | 0.545455 |
| 3 | 215354 | 5714465 | Active | -203 | 0 | NaN | NaN | 0 | 90000.00 | NaN | NaN | 0.0 | Credit card | -16 | 0.545455 |
| 4 | 215354 | 5714466 | Active | -629 | 0 | 1197.0 | 77674.5 | 0 | 2700000.00 | NaN | NaN | 0.0 | Consumer credit | -21 | 0.545455 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 1716423 | 259355 | 5057750 | Active | -44 | 0 | -30.0 | 0.0 | 0 | 11250.00 | 11250.0 | 0.0 | 0.0 | Microloan | -19 | 1.000000 |
| 1716424 | 100044 | 5057754 | Closed | -2648 | 0 | -2433.0 | 5476.5 | 0 | 38130.84 | 0.0 | 0.0 | 0.0 | Consumer credit | -2493 | 0.545455 |
| 1716425 | 100044 | 5057762 | Closed | -1809 | 0 | -1628.0 | NaN | 0 | 15570.00 | NaN | NaN | 0.0 | Consumer credit | -967 | 0.545455 |
| 1716426 | 246829 | 5057770 | Closed | -1878 | 0 | -1513.0 | NaN | 0 | 36000.00 | 0.0 | 0.0 | 0.0 | Consumer credit | -1508 | 0.258065 |
| 1716427 | 246829 | 5057778 | Closed | -463 | 0 | NaN | NaN | 0 | 22500.00 | 0.0 | NaN | 0.0 | Microloan | -387 | 0.258065 |
1716428 rows × 15 columns
bureau_data['CNT_CREDIT_PROLONG'].value_counts()
0 1707314 1 7620 2 1222 3 191 4 54 5 21 9 2 6 2 8 1 7 1 Name: CNT_CREDIT_PROLONG, dtype: int64
bureau_data['AMT_CREDIT_MAX_OVERDUE'].fillna(0, inplace = True)
bureau_data['AMT_CREDIT_SUM_LIMIT'].isnull().value_counts()
False 1124648 True 591780 Name: AMT_CREDIT_SUM_LIMIT, dtype: int64
bureau_data['AMT_CREDIT_SUM'].fillna(0, inplace = True)
bureau_data['AMT_CREDIT_SUM_LIMIT'].fillna(0, inplace = True)
bureau_data[['AMT_CREDIT_SUM','AMT_CREDIT_SUM_DEBT','AMT_CREDIT_SUM_LIMIT']]
| AMT_CREDIT_SUM | AMT_CREDIT_SUM_DEBT | AMT_CREDIT_SUM_LIMIT | |
|---|---|---|---|
| 0 | 91323.00 | 0.0 | 0.0 |
| 1 | 225000.00 | 171342.0 | 0.0 |
| 2 | 464323.50 | NaN | 0.0 |
| 3 | 90000.00 | NaN | 0.0 |
| 4 | 2700000.00 | NaN | 0.0 |
| ... | ... | ... | ... |
| 1716423 | 11250.00 | 11250.0 | 0.0 |
| 1716424 | 38130.84 | 0.0 | 0.0 |
| 1716425 | 15570.00 | NaN | 0.0 |
| 1716426 | 36000.00 | 0.0 | 0.0 |
| 1716427 | 22500.00 | 0.0 | 0.0 |
1716428 rows × 3 columns
bureau_data['AMT_CREDIT_SUM_DEBT'].fillna(bureau_data['AMT_CREDIT_SUM'], inplace = True)
difference = bureau_data['AMT_CREDIT_SUM'] - bureau_data['AMT_CREDIT_SUM_DEBT']
bureau_data['AMT_CREDIT_SUM_LIMIT'].fillna(difference, inplace = True)
bureau_data.drop('DAYS_CREDIT_ENDDATE', axis = 1, inplace = True)
msno.bar(bureau_data)
plt.show()
results = pd.DataFrame(columns = ["Pipeline",
"Dataset",
"TrainAcc",
"ValidAcc",
"TestAcc",
"TrainROC",
"TestROC",
"ValidROC",
"Feature Added"])
def model_logreg(train,feature):
X = train.drop(['TARGET'], axis = 1)
y = train["TARGET"]
# Split the provided training data into training and validationa and test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=42)
X_train, X_valid, y_train, y_valid = train_test_split(X_train, y_train, test_size=0.2, random_state=42)
print(f"X train shape: {X_train.shape}")
print(f"X validation shape: {X_valid.shape}")
print(f"X test shape: {X_test.shape}")
numerical_features = X_train.select_dtypes(include = ['int64', 'float64']).columns
categorical_features = X_train.select_dtypes(include = ['object', 'bool']).columns
#print(f"\nNumerical features : {list(numerical_features)}")
#print(f"\nCategorical features : {list(categorical_features)}")
num_pipeline = Pipeline([
('scaler', StandardScaler()),
('imputer', SimpleImputer(strategy = 'median'))
])
cat_pipeline = Pipeline([
('imputer', SimpleImputer(strategy='most_frequent')),
('ohe', OneHotEncoder(sparse=False, handle_unknown="ignore"))
])
data_pipeline = ColumnTransformer([
("num_pipeline", num_pipeline, numerical_features),
("cat_pipeline", cat_pipeline, categorical_features)], remainder = 'drop', n_jobs = -1)
X_train_transformed = data_pipeline.fit_transform(X_train)
column_names = list(numerical_features) + \
list(data_pipeline.transformers_[1][1].named_steps["ohe"].get_feature_names(categorical_features))
X_train_transformed_df = pd.DataFrame(X_train_transformed, columns=column_names)
display(X_train_transformed_df.head())
clf_pipe = make_pipeline(data_pipeline, LogisticRegression())
clf_pipe.fit(X_train, y_train)
train_acc = clf_pipe.score(X_train, y_train)
validAcc = clf_pipe.score(X_valid, y_valid)
testAcc = clf_pipe.score(X_test, y_test)
## Plotting AUC / ROC
ns_probs = [0 for _ in range(len(y_test))]
lr_probs = clf_pipe.predict_proba(X_test)
# keep probabilities for the positive outcome only
lr_probs = lr_probs[:, 1]
# calculate scores
ns_auc = roc_auc_score(y_test, ns_probs)
lr_auc = roc_auc_score(y_test, lr_probs)
print('No Skill: ROC AUC=%.3f' % (ns_auc))
print('Logistic: ROC AUC=%.3f' % (lr_auc))
# calculate roc curves
ns_fpr, ns_tpr, _ = roc_curve(y_test, ns_probs)
lr_fpr, lr_tpr, _ = roc_curve(y_test, lr_probs)
# plot the roc curve for the model
plt.plot(ns_fpr, ns_tpr, linestyle='--', label='No Skill')
plt.plot(lr_fpr, lr_tpr, marker='.', label='Logistic')
# axis labels
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
# show the legend
plt.legend()
# show the plot
plt.show()
train_roc = roc_auc_score(y_train, clf_pipe.predict_proba(X_train)[:, 1])
test_roc = roc_auc_score(y_test, clf_pipe.predict_proba(X_test)[:, 1])
valid_roc = roc_auc_score(y_valid, clf_pipe.predict_proba(X_valid)[:, 1])
results.loc[len(results)] = ["Logistic Regression","HCDR",f"{train_acc*100:8.2f}%",
f"{validAcc*100:8.2f}%", f"{testAcc*100:8.2f}%",f"{np.round(train_roc,4)}",f"{np.round(test_roc,4)}",f"{np.round(valid_roc,4)}",f"{feature}"]
display(results)
return clf_pipe
bureau_active_loans = pd.DataFrame(bureau_data[['SK_ID_CURR','ACTIVE_LOANS_PERCENTAGE']], columns = ['SK_ID_CURR','ACTIVE_LOANS_PERCENTAGE'])
bureau_active_loans
| SK_ID_CURR | ACTIVE_LOANS_PERCENTAGE | |
|---|---|---|
| 0 | 215354 | 0.545455 |
| 1 | 215354 | 0.545455 |
| 2 | 215354 | 0.545455 |
| 3 | 215354 | 0.545455 |
| 4 | 215354 | 0.545455 |
| ... | ... | ... |
| 1716423 | 259355 | 1.000000 |
| 1716424 | 100044 | 0.545455 |
| 1716425 | 100044 | 0.545455 |
| 1716426 | 246829 | 0.258065 |
| 1716427 | 246829 | 0.258065 |
1716428 rows × 2 columns
train = datasets['application_train']
train
| SK_ID_CURR | TARGET | NAME_CONTRACT_TYPE | CODE_GENDER | FLAG_OWN_CAR | FLAG_OWN_REALTY | CNT_CHILDREN | AMT_INCOME_TOTAL | AMT_CREDIT | AMT_ANNUITY | ... | FLAG_DOCUMENT_18 | FLAG_DOCUMENT_19 | FLAG_DOCUMENT_20 | FLAG_DOCUMENT_21 | AMT_REQ_CREDIT_BUREAU_HOUR | AMT_REQ_CREDIT_BUREAU_DAY | AMT_REQ_CREDIT_BUREAU_WEEK | AMT_REQ_CREDIT_BUREAU_MON | AMT_REQ_CREDIT_BUREAU_QRT | AMT_REQ_CREDIT_BUREAU_YEAR | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 100002 | 1 | Cash loans | M | N | Y | 0 | 202500.0 | 406597.5 | 24700.5 | ... | 0 | 0 | 0 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 |
| 1 | 100003 | 0 | Cash loans | F | N | N | 0 | 270000.0 | 1293502.5 | 35698.5 | ... | 0 | 0 | 0 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
| 2 | 100004 | 0 | Revolving loans | M | Y | Y | 0 | 67500.0 | 135000.0 | 6750.0 | ... | 0 | 0 | 0 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
| 3 | 100006 | 0 | Cash loans | F | N | Y | 0 | 135000.0 | 312682.5 | 29686.5 | ... | 0 | 0 | 0 | 0 | NaN | NaN | NaN | NaN | NaN | NaN |
| 4 | 100007 | 0 | Cash loans | M | N | Y | 0 | 121500.0 | 513000.0 | 21865.5 | ... | 0 | 0 | 0 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 307506 | 456251 | 0 | Cash loans | M | N | N | 0 | 157500.0 | 254700.0 | 27558.0 | ... | 0 | 0 | 0 | 0 | NaN | NaN | NaN | NaN | NaN | NaN |
| 307507 | 456252 | 0 | Cash loans | F | N | Y | 0 | 72000.0 | 269550.0 | 12001.5 | ... | 0 | 0 | 0 | 0 | NaN | NaN | NaN | NaN | NaN | NaN |
| 307508 | 456253 | 0 | Cash loans | F | N | Y | 0 | 153000.0 | 677664.0 | 29979.0 | ... | 0 | 0 | 0 | 0 | 1.0 | 0.0 | 0.0 | 1.0 | 0.0 | 1.0 |
| 307509 | 456254 | 1 | Cash loans | F | N | Y | 0 | 171000.0 | 370107.0 | 20205.0 | ... | 0 | 0 | 0 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
| 307510 | 456255 | 0 | Cash loans | F | N | N | 0 | 157500.0 | 675000.0 | 49117.5 | ... | 0 | 0 | 0 | 0 | 0.0 | 0.0 | 0.0 | 2.0 | 0.0 | 1.0 |
307511 rows × 117 columns
train.drop(['DAYS_LAST_PHONE_CHANGE','OBS_30_CNT_SOCIAL_CIRCLE','OBS_60_CNT_SOCIAL_CIRCLE','DEF_30_CNT_SOCIAL_CIRCLE','DEF_60_CNT_SOCIAL_CIRCLE'], axis = 1, inplace = True)
train = train.merge(bureau_active_loans, on = 'SK_ID_CURR', how = 'left')
train
| SK_ID_CURR | TARGET | NAME_CONTRACT_TYPE | CODE_GENDER | FLAG_OWN_CAR | FLAG_OWN_REALTY | CNT_CHILDREN | AMT_INCOME_TOTAL | AMT_CREDIT | AMT_ANNUITY | ... | FLAG_DOCUMENT_19 | FLAG_DOCUMENT_20 | FLAG_DOCUMENT_21 | AMT_REQ_CREDIT_BUREAU_HOUR | AMT_REQ_CREDIT_BUREAU_DAY | AMT_REQ_CREDIT_BUREAU_WEEK | AMT_REQ_CREDIT_BUREAU_MON | AMT_REQ_CREDIT_BUREAU_QRT | AMT_REQ_CREDIT_BUREAU_YEAR | ACTIVE_LOANS_PERCENTAGE | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 100002 | 1 | Cash loans | M | N | Y | 0 | 202500.0 | 406597.5 | 24700.5 | ... | 0 | 0 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.250000 |
| 1 | 100002 | 1 | Cash loans | M | N | Y | 0 | 202500.0 | 406597.5 | 24700.5 | ... | 0 | 0 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.250000 |
| 2 | 100002 | 1 | Cash loans | M | N | Y | 0 | 202500.0 | 406597.5 | 24700.5 | ... | 0 | 0 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.250000 |
| 3 | 100002 | 1 | Cash loans | M | N | Y | 0 | 202500.0 | 406597.5 | 24700.5 | ... | 0 | 0 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.250000 |
| 4 | 100002 | 1 | Cash loans | M | N | Y | 0 | 202500.0 | 406597.5 | 24700.5 | ... | 0 | 0 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.250000 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 1509340 | 456255 | 0 | Cash loans | F | N | N | 0 | 157500.0 | 675000.0 | 49117.5 | ... | 0 | 0 | 0 | 0.0 | 0.0 | 0.0 | 2.0 | 0.0 | 1.0 | 0.454545 |
| 1509341 | 456255 | 0 | Cash loans | F | N | N | 0 | 157500.0 | 675000.0 | 49117.5 | ... | 0 | 0 | 0 | 0.0 | 0.0 | 0.0 | 2.0 | 0.0 | 1.0 | 0.454545 |
| 1509342 | 456255 | 0 | Cash loans | F | N | N | 0 | 157500.0 | 675000.0 | 49117.5 | ... | 0 | 0 | 0 | 0.0 | 0.0 | 0.0 | 2.0 | 0.0 | 1.0 | 0.454545 |
| 1509343 | 456255 | 0 | Cash loans | F | N | N | 0 | 157500.0 | 675000.0 | 49117.5 | ... | 0 | 0 | 0 | 0.0 | 0.0 | 0.0 | 2.0 | 0.0 | 1.0 | 0.454545 |
| 1509344 | 456255 | 0 | Cash loans | F | N | N | 0 | 157500.0 | 675000.0 | 49117.5 | ... | 0 | 0 | 0 | 0.0 | 0.0 | 0.0 | 2.0 | 0.0 | 1.0 | 0.454545 |
1509345 rows × 118 columns
train = train.loc[:, ~train.columns.str.startswith("FLAG_DOCUMENT_")]
train = train.loc[:, ~train.columns.str.endswith("MODE")]
train = train.loc[:, ~train.columns.str.endswith("MEDI")]
train = train.loc[:, ~train.columns.str.endswith("AVG")]
train
| SK_ID_CURR | TARGET | NAME_CONTRACT_TYPE | CODE_GENDER | FLAG_OWN_CAR | FLAG_OWN_REALTY | CNT_CHILDREN | AMT_INCOME_TOTAL | AMT_CREDIT | AMT_ANNUITY | ... | EXT_SOURCE_1 | EXT_SOURCE_2 | EXT_SOURCE_3 | AMT_REQ_CREDIT_BUREAU_HOUR | AMT_REQ_CREDIT_BUREAU_DAY | AMT_REQ_CREDIT_BUREAU_WEEK | AMT_REQ_CREDIT_BUREAU_MON | AMT_REQ_CREDIT_BUREAU_QRT | AMT_REQ_CREDIT_BUREAU_YEAR | ACTIVE_LOANS_PERCENTAGE | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 100002 | 1 | Cash loans | M | N | Y | 0 | 202500.0 | 406597.5 | 24700.5 | ... | 0.083037 | 0.262949 | 0.139376 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.250000 |
| 1 | 100002 | 1 | Cash loans | M | N | Y | 0 | 202500.0 | 406597.5 | 24700.5 | ... | 0.083037 | 0.262949 | 0.139376 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.250000 |
| 2 | 100002 | 1 | Cash loans | M | N | Y | 0 | 202500.0 | 406597.5 | 24700.5 | ... | 0.083037 | 0.262949 | 0.139376 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.250000 |
| 3 | 100002 | 1 | Cash loans | M | N | Y | 0 | 202500.0 | 406597.5 | 24700.5 | ... | 0.083037 | 0.262949 | 0.139376 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.250000 |
| 4 | 100002 | 1 | Cash loans | M | N | Y | 0 | 202500.0 | 406597.5 | 24700.5 | ... | 0.083037 | 0.262949 | 0.139376 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.250000 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 1509340 | 456255 | 0 | Cash loans | F | N | N | 0 | 157500.0 | 675000.0 | 49117.5 | ... | 0.734460 | 0.708569 | 0.113922 | 0.0 | 0.0 | 0.0 | 2.0 | 0.0 | 1.0 | 0.454545 |
| 1509341 | 456255 | 0 | Cash loans | F | N | N | 0 | 157500.0 | 675000.0 | 49117.5 | ... | 0.734460 | 0.708569 | 0.113922 | 0.0 | 0.0 | 0.0 | 2.0 | 0.0 | 1.0 | 0.454545 |
| 1509342 | 456255 | 0 | Cash loans | F | N | N | 0 | 157500.0 | 675000.0 | 49117.5 | ... | 0.734460 | 0.708569 | 0.113922 | 0.0 | 0.0 | 0.0 | 2.0 | 0.0 | 1.0 | 0.454545 |
| 1509343 | 456255 | 0 | Cash loans | F | N | N | 0 | 157500.0 | 675000.0 | 49117.5 | ... | 0.734460 | 0.708569 | 0.113922 | 0.0 | 0.0 | 0.0 | 2.0 | 0.0 | 1.0 | 0.454545 |
| 1509344 | 456255 | 0 | Cash loans | F | N | N | 0 | 157500.0 | 675000.0 | 49117.5 | ... | 0.734460 | 0.708569 | 0.113922 | 0.0 | 0.0 | 0.0 | 2.0 | 0.0 | 1.0 | 0.454545 |
1509345 rows × 51 columns
active_loan_model = model_logreg(train, 'Active loan Percentage feature')
X train shape: (965980, 97) X validation shape: (241496, 97) X test shape: (301869, 97)
| SK_ID_CURR | CNT_CHILDREN | AMT_INCOME_TOTAL | AMT_CREDIT | AMT_ANNUITY | AMT_GOODS_PRICE | REGION_POPULATION_RELATIVE | DAYS_BIRTH | DAYS_EMPLOYED | DAYS_REGISTRATION | ... | HOUSETYPE_MODE_terraced house | WALLSMATERIAL_MODE_Block | WALLSMATERIAL_MODE_Mixed | WALLSMATERIAL_MODE_Monolithic | WALLSMATERIAL_MODE_Others | WALLSMATERIAL_MODE_Panel | WALLSMATERIAL_MODE_Stone, brick | WALLSMATERIAL_MODE_Wooden | EMERGENCYSTATE_MODE_No | EMERGENCYSTATE_MODE_Yes | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | -0.714681 | -0.585812 | 0.000407 | -0.970848 | -0.215580 | -0.892008 | 1.161776 | 1.429968 | -0.449881 | -0.017378 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 1.0 | 0.0 |
| 1 | -1.062862 | -0.585812 | -0.191312 | 2.262115 | 1.247094 | 2.101707 | 0.824933 | 0.343550 | -0.447546 | 0.702082 | ... | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 |
| 2 | 1.282842 | -0.585812 | 0.057922 | -0.187606 | -0.483416 | -0.293265 | -0.835134 | -0.220450 | -0.514405 | 1.150532 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 |
| 3 | -0.163686 | -0.585812 | -0.287171 | 1.231743 | 0.630882 | 1.502964 | -0.165296 | -1.529123 | 2.165432 | -0.029074 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 1.0 | 0.0 |
| 4 | -0.411456 | -0.585812 | 1.150719 | 3.407022 | 2.134300 | 3.299193 | -0.782452 | 0.279264 | -0.450595 | 1.299730 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 1.0 | 0.0 |
5 rows × 220 columns
No Skill: ROC AUC=0.500 Logistic: ROC AUC=0.751
| Pipeline | Dataset | TrainAcc | ValidAcc | TestAcc | TrainROC | TestROC | ValidROC | Feature Added | |
|---|---|---|---|---|---|---|---|---|---|
| 0 | Logistic Regression | HCDR | 92.13% | 92.15% | 92.13% | 0.752 | 0.7505 | 0.7496 | Active loan Percentage feature |
amt_income = train['AMT_INCOME_TOTAL']
amt_credit = train['AMT_CREDIT']
credit_income_perc = amt_income/amt_credit
train['CREDIT_INCOME_RATIO'] = credit_income_perc
train['CREDIT_INCOME_RATIO']
0 0.498036
1 0.498036
2 0.498036
3 0.498036
4 0.498036
...
1509340 0.233333
1509341 0.233333
1509342 0.233333
1509343 0.233333
1509344 0.233333
Name: CREDIT_INCOME_RATIO, Length: 1509345, dtype: float64
train
| SK_ID_CURR | TARGET | NAME_CONTRACT_TYPE | CODE_GENDER | FLAG_OWN_CAR | FLAG_OWN_REALTY | CNT_CHILDREN | AMT_INCOME_TOTAL | AMT_CREDIT | AMT_ANNUITY | ... | EXT_SOURCE_2 | EXT_SOURCE_3 | AMT_REQ_CREDIT_BUREAU_HOUR | AMT_REQ_CREDIT_BUREAU_DAY | AMT_REQ_CREDIT_BUREAU_WEEK | AMT_REQ_CREDIT_BUREAU_MON | AMT_REQ_CREDIT_BUREAU_QRT | AMT_REQ_CREDIT_BUREAU_YEAR | ACTIVE_LOANS_PERCENTAGE | CREDIT_INCOME_RATIO | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 100002 | 1 | Cash loans | M | N | Y | 0 | 202500.0 | 406597.5 | 24700.5 | ... | 0.262949 | 0.139376 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.250000 | 0.498036 |
| 1 | 100002 | 1 | Cash loans | M | N | Y | 0 | 202500.0 | 406597.5 | 24700.5 | ... | 0.262949 | 0.139376 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.250000 | 0.498036 |
| 2 | 100002 | 1 | Cash loans | M | N | Y | 0 | 202500.0 | 406597.5 | 24700.5 | ... | 0.262949 | 0.139376 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.250000 | 0.498036 |
| 3 | 100002 | 1 | Cash loans | M | N | Y | 0 | 202500.0 | 406597.5 | 24700.5 | ... | 0.262949 | 0.139376 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.250000 | 0.498036 |
| 4 | 100002 | 1 | Cash loans | M | N | Y | 0 | 202500.0 | 406597.5 | 24700.5 | ... | 0.262949 | 0.139376 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.250000 | 0.498036 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 1509340 | 456255 | 0 | Cash loans | F | N | N | 0 | 157500.0 | 675000.0 | 49117.5 | ... | 0.708569 | 0.113922 | 0.0 | 0.0 | 0.0 | 2.0 | 0.0 | 1.0 | 0.454545 | 0.233333 |
| 1509341 | 456255 | 0 | Cash loans | F | N | N | 0 | 157500.0 | 675000.0 | 49117.5 | ... | 0.708569 | 0.113922 | 0.0 | 0.0 | 0.0 | 2.0 | 0.0 | 1.0 | 0.454545 | 0.233333 |
| 1509342 | 456255 | 0 | Cash loans | F | N | N | 0 | 157500.0 | 675000.0 | 49117.5 | ... | 0.708569 | 0.113922 | 0.0 | 0.0 | 0.0 | 2.0 | 0.0 | 1.0 | 0.454545 | 0.233333 |
| 1509343 | 456255 | 0 | Cash loans | F | N | N | 0 | 157500.0 | 675000.0 | 49117.5 | ... | 0.708569 | 0.113922 | 0.0 | 0.0 | 0.0 | 2.0 | 0.0 | 1.0 | 0.454545 | 0.233333 |
| 1509344 | 456255 | 0 | Cash loans | F | N | N | 0 | 157500.0 | 675000.0 | 49117.5 | ... | 0.708569 | 0.113922 | 0.0 | 0.0 | 0.0 | 2.0 | 0.0 | 1.0 | 0.454545 | 0.233333 |
1509345 rows × 52 columns
credit_income_model = model_logreg(train, 'Credit Income Ratio feature')
X train shape: (965980, 98) X validation shape: (241496, 98) X test shape: (301869, 98)
| SK_ID_CURR | CNT_CHILDREN | AMT_INCOME_TOTAL | AMT_CREDIT | AMT_ANNUITY | AMT_GOODS_PRICE | REGION_POPULATION_RELATIVE | DAYS_BIRTH | DAYS_EMPLOYED | DAYS_REGISTRATION | ... | HOUSETYPE_MODE_terraced house | WALLSMATERIAL_MODE_Block | WALLSMATERIAL_MODE_Mixed | WALLSMATERIAL_MODE_Monolithic | WALLSMATERIAL_MODE_Others | WALLSMATERIAL_MODE_Panel | WALLSMATERIAL_MODE_Stone, brick | WALLSMATERIAL_MODE_Wooden | EMERGENCYSTATE_MODE_No | EMERGENCYSTATE_MODE_Yes | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | -0.714681 | -0.585812 | 0.000407 | -0.970848 | -0.215580 | -0.892008 | 1.161776 | 1.429968 | -0.449881 | -0.017378 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 1.0 | 0.0 |
| 1 | -1.062862 | -0.585812 | -0.191312 | 2.262115 | 1.247094 | 2.101707 | 0.824933 | 0.343550 | -0.447546 | 0.702082 | ... | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 |
| 2 | 1.282842 | -0.585812 | 0.057922 | -0.187606 | -0.483416 | -0.293265 | -0.835134 | -0.220450 | -0.514405 | 1.150532 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 |
| 3 | -0.163686 | -0.585812 | -0.287171 | 1.231743 | 0.630882 | 1.502964 | -0.165296 | -1.529123 | 2.165432 | -0.029074 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 1.0 | 0.0 |
| 4 | -0.411456 | -0.585812 | 1.150719 | 3.407022 | 2.134300 | 3.299193 | -0.782452 | 0.279264 | -0.450595 | 1.299730 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 1.0 | 0.0 |
5 rows × 221 columns
No Skill: ROC AUC=0.500 Logistic: ROC AUC=0.751
| Pipeline | Dataset | TrainAcc | ValidAcc | TestAcc | TrainROC | TestROC | ValidROC | Feature Added | |
|---|---|---|---|---|---|---|---|---|---|
| 0 | Logistic Regression | HCDR | 92.13% | 92.15% | 92.13% | 0.752 | 0.7505 | 0.7496 | Active loan Percentage feature |
| 1 | Logistic Regression | HCDR | 92.13% | 92.15% | 92.13% | 0.7526 | 0.7511 | 0.7502 | Credit Income Ratio feature |
How many years does it take for the borrower to repay the amount he asked for the application -
train['YEARS_TO_PAY'] = (train['AMT_CREDIT']/train['AMT_ANNUITY']).round()
train[['YEARS_TO_PAY']]
| YEARS_TO_PAY | |
|---|---|
| 0 | 16.0 |
| 1 | 16.0 |
| 2 | 16.0 |
| 3 | 16.0 |
| 4 | 16.0 |
| ... | ... |
| 1509340 | 14.0 |
| 1509341 | 14.0 |
| 1509342 | 14.0 |
| 1509343 | 14.0 |
| 1509344 | 14.0 |
1509345 rows × 1 columns
np.isinf(train[['YEARS_TO_PAY']]).values.sum()
0
train
| SK_ID_CURR | TARGET | NAME_CONTRACT_TYPE | CODE_GENDER | FLAG_OWN_CAR | FLAG_OWN_REALTY | CNT_CHILDREN | AMT_INCOME_TOTAL | AMT_CREDIT | AMT_ANNUITY | ... | EXT_SOURCE_3 | AMT_REQ_CREDIT_BUREAU_HOUR | AMT_REQ_CREDIT_BUREAU_DAY | AMT_REQ_CREDIT_BUREAU_WEEK | AMT_REQ_CREDIT_BUREAU_MON | AMT_REQ_CREDIT_BUREAU_QRT | AMT_REQ_CREDIT_BUREAU_YEAR | ACTIVE_LOANS_PERCENTAGE | CREDIT_INCOME_RATIO | YEARS_TO_PAY | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 100002 | 1 | Cash loans | M | N | Y | 0 | 202500.0 | 406597.5 | 24700.5 | ... | 0.139376 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.250000 | 0.498036 | 16.0 |
| 1 | 100002 | 1 | Cash loans | M | N | Y | 0 | 202500.0 | 406597.5 | 24700.5 | ... | 0.139376 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.250000 | 0.498036 | 16.0 |
| 2 | 100002 | 1 | Cash loans | M | N | Y | 0 | 202500.0 | 406597.5 | 24700.5 | ... | 0.139376 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.250000 | 0.498036 | 16.0 |
| 3 | 100002 | 1 | Cash loans | M | N | Y | 0 | 202500.0 | 406597.5 | 24700.5 | ... | 0.139376 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.250000 | 0.498036 | 16.0 |
| 4 | 100002 | 1 | Cash loans | M | N | Y | 0 | 202500.0 | 406597.5 | 24700.5 | ... | 0.139376 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.250000 | 0.498036 | 16.0 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 1509340 | 456255 | 0 | Cash loans | F | N | N | 0 | 157500.0 | 675000.0 | 49117.5 | ... | 0.113922 | 0.0 | 0.0 | 0.0 | 2.0 | 0.0 | 1.0 | 0.454545 | 0.233333 | 14.0 |
| 1509341 | 456255 | 0 | Cash loans | F | N | N | 0 | 157500.0 | 675000.0 | 49117.5 | ... | 0.113922 | 0.0 | 0.0 | 0.0 | 2.0 | 0.0 | 1.0 | 0.454545 | 0.233333 | 14.0 |
| 1509342 | 456255 | 0 | Cash loans | F | N | N | 0 | 157500.0 | 675000.0 | 49117.5 | ... | 0.113922 | 0.0 | 0.0 | 0.0 | 2.0 | 0.0 | 1.0 | 0.454545 | 0.233333 | 14.0 |
| 1509343 | 456255 | 0 | Cash loans | F | N | N | 0 | 157500.0 | 675000.0 | 49117.5 | ... | 0.113922 | 0.0 | 0.0 | 0.0 | 2.0 | 0.0 | 1.0 | 0.454545 | 0.233333 | 14.0 |
| 1509344 | 456255 | 0 | Cash loans | F | N | N | 0 | 157500.0 | 675000.0 | 49117.5 | ... | 0.113922 | 0.0 | 0.0 | 0.0 | 2.0 | 0.0 | 1.0 | 0.454545 | 0.233333 | 14.0 |
1509345 rows × 53 columns
years_to_pay_model = model_logreg(train, 'Credit-Annuity Ratio of Current Application Feature')
X train shape: (965980, 99) X validation shape: (241496, 99) X test shape: (301869, 99)
| SK_ID_CURR | CNT_CHILDREN | AMT_INCOME_TOTAL | AMT_CREDIT | AMT_ANNUITY | AMT_GOODS_PRICE | REGION_POPULATION_RELATIVE | DAYS_BIRTH | DAYS_EMPLOYED | DAYS_REGISTRATION | ... | HOUSETYPE_MODE_terraced house | WALLSMATERIAL_MODE_Block | WALLSMATERIAL_MODE_Mixed | WALLSMATERIAL_MODE_Monolithic | WALLSMATERIAL_MODE_Others | WALLSMATERIAL_MODE_Panel | WALLSMATERIAL_MODE_Stone, brick | WALLSMATERIAL_MODE_Wooden | EMERGENCYSTATE_MODE_No | EMERGENCYSTATE_MODE_Yes | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | -0.714681 | -0.585812 | 0.000407 | -0.970848 | -0.215580 | -0.892008 | 1.161776 | 1.429968 | -0.449881 | -0.017378 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 1.0 | 0.0 |
| 1 | -1.062862 | -0.585812 | -0.191312 | 2.262115 | 1.247094 | 2.101707 | 0.824933 | 0.343550 | -0.447546 | 0.702082 | ... | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 |
| 2 | 1.282842 | -0.585812 | 0.057922 | -0.187606 | -0.483416 | -0.293265 | -0.835134 | -0.220450 | -0.514405 | 1.150532 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 |
| 3 | -0.163686 | -0.585812 | -0.287171 | 1.231743 | 0.630882 | 1.502964 | -0.165296 | -1.529123 | 2.165432 | -0.029074 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 1.0 | 0.0 |
| 4 | -0.411456 | -0.585812 | 1.150719 | 3.407022 | 2.134300 | 3.299193 | -0.782452 | 0.279264 | -0.450595 | 1.299730 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 1.0 | 0.0 |
5 rows × 222 columns
No Skill: ROC AUC=0.500 Logistic: ROC AUC=0.751
| Pipeline | Dataset | TrainAcc | ValidAcc | TestAcc | TrainROC | TestROC | ValidROC | Feature Added | |
|---|---|---|---|---|---|---|---|---|---|
| 0 | Logistic Regression | HCDR | 92.13% | 92.15% | 92.13% | 0.752 | 0.7505 | 0.7496 | Active loan Percentage feature |
| 1 | Logistic Regression | HCDR | 92.13% | 92.15% | 92.13% | 0.7526 | 0.7511 | 0.7502 | Credit Income Ratio feature |
| 2 | Logistic Regression | HCDR | 92.13% | 92.15% | 92.13% | 0.7525 | 0.7511 | 0.7501 | Credit-Annuity Ratio of Current Application Fe... |
train['INCOME_ANNUITY'] = (train['AMT_INCOME_TOTAL']/train['AMT_ANNUITY']).round()
train[['INCOME_ANNUITY']]
| INCOME_ANNUITY | |
|---|---|
| 0 | 8.0 |
| 1 | 8.0 |
| 2 | 8.0 |
| 3 | 8.0 |
| 4 | 8.0 |
| ... | ... |
| 1509340 | 3.0 |
| 1509341 | 3.0 |
| 1509342 | 3.0 |
| 1509343 | 3.0 |
| 1509344 | 3.0 |
1509345 rows × 1 columns
train
| SK_ID_CURR | TARGET | NAME_CONTRACT_TYPE | CODE_GENDER | FLAG_OWN_CAR | FLAG_OWN_REALTY | CNT_CHILDREN | AMT_INCOME_TOTAL | AMT_CREDIT | AMT_ANNUITY | ... | AMT_REQ_CREDIT_BUREAU_HOUR | AMT_REQ_CREDIT_BUREAU_DAY | AMT_REQ_CREDIT_BUREAU_WEEK | AMT_REQ_CREDIT_BUREAU_MON | AMT_REQ_CREDIT_BUREAU_QRT | AMT_REQ_CREDIT_BUREAU_YEAR | ACTIVE_LOANS_PERCENTAGE | CREDIT_INCOME_RATIO | YEARS_TO_PAY | INCOME_ANNUITY | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 100002 | 1 | Cash loans | M | N | Y | 0 | 202500.0 | 406597.5 | 24700.5 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.250000 | 0.498036 | 16.0 | 8.0 |
| 1 | 100002 | 1 | Cash loans | M | N | Y | 0 | 202500.0 | 406597.5 | 24700.5 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.250000 | 0.498036 | 16.0 | 8.0 |
| 2 | 100002 | 1 | Cash loans | M | N | Y | 0 | 202500.0 | 406597.5 | 24700.5 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.250000 | 0.498036 | 16.0 | 8.0 |
| 3 | 100002 | 1 | Cash loans | M | N | Y | 0 | 202500.0 | 406597.5 | 24700.5 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.250000 | 0.498036 | 16.0 | 8.0 |
| 4 | 100002 | 1 | Cash loans | M | N | Y | 0 | 202500.0 | 406597.5 | 24700.5 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.250000 | 0.498036 | 16.0 | 8.0 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 1509340 | 456255 | 0 | Cash loans | F | N | N | 0 | 157500.0 | 675000.0 | 49117.5 | ... | 0.0 | 0.0 | 0.0 | 2.0 | 0.0 | 1.0 | 0.454545 | 0.233333 | 14.0 | 3.0 |
| 1509341 | 456255 | 0 | Cash loans | F | N | N | 0 | 157500.0 | 675000.0 | 49117.5 | ... | 0.0 | 0.0 | 0.0 | 2.0 | 0.0 | 1.0 | 0.454545 | 0.233333 | 14.0 | 3.0 |
| 1509342 | 456255 | 0 | Cash loans | F | N | N | 0 | 157500.0 | 675000.0 | 49117.5 | ... | 0.0 | 0.0 | 0.0 | 2.0 | 0.0 | 1.0 | 0.454545 | 0.233333 | 14.0 | 3.0 |
| 1509343 | 456255 | 0 | Cash loans | F | N | N | 0 | 157500.0 | 675000.0 | 49117.5 | ... | 0.0 | 0.0 | 0.0 | 2.0 | 0.0 | 1.0 | 0.454545 | 0.233333 | 14.0 | 3.0 |
| 1509344 | 456255 | 0 | Cash loans | F | N | N | 0 | 157500.0 | 675000.0 | 49117.5 | ... | 0.0 | 0.0 | 0.0 | 2.0 | 0.0 | 1.0 | 0.454545 | 0.233333 | 14.0 | 3.0 |
1509345 rows × 54 columns
income_annuity_model = model_logreg(train, 'Income-Annuity Ratio of Current Application Feature')
X train shape: (965980, 100) X validation shape: (241496, 100) X test shape: (301869, 100)
| SK_ID_CURR | CNT_CHILDREN | AMT_INCOME_TOTAL | AMT_CREDIT | AMT_ANNUITY | AMT_GOODS_PRICE | REGION_POPULATION_RELATIVE | DAYS_BIRTH | DAYS_EMPLOYED | DAYS_REGISTRATION | ... | HOUSETYPE_MODE_terraced house | WALLSMATERIAL_MODE_Block | WALLSMATERIAL_MODE_Mixed | WALLSMATERIAL_MODE_Monolithic | WALLSMATERIAL_MODE_Others | WALLSMATERIAL_MODE_Panel | WALLSMATERIAL_MODE_Stone, brick | WALLSMATERIAL_MODE_Wooden | EMERGENCYSTATE_MODE_No | EMERGENCYSTATE_MODE_Yes | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | -0.714681 | -0.585812 | 0.000407 | -0.970848 | -0.215580 | -0.892008 | 1.161776 | 1.429968 | -0.449881 | -0.017378 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 1.0 | 0.0 |
| 1 | -1.062862 | -0.585812 | -0.191312 | 2.262115 | 1.247094 | 2.101707 | 0.824933 | 0.343550 | -0.447546 | 0.702082 | ... | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 |
| 2 | 1.282842 | -0.585812 | 0.057922 | -0.187606 | -0.483416 | -0.293265 | -0.835134 | -0.220450 | -0.514405 | 1.150532 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 |
| 3 | -0.163686 | -0.585812 | -0.287171 | 1.231743 | 0.630882 | 1.502964 | -0.165296 | -1.529123 | 2.165432 | -0.029074 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 1.0 | 0.0 |
| 4 | -0.411456 | -0.585812 | 1.150719 | 3.407022 | 2.134300 | 3.299193 | -0.782452 | 0.279264 | -0.450595 | 1.299730 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 1.0 | 0.0 |
5 rows × 223 columns
No Skill: ROC AUC=0.500 Logistic: ROC AUC=0.751
| Pipeline | Dataset | TrainAcc | ValidAcc | TestAcc | TrainROC | TestROC | ValidROC | Feature Added | |
|---|---|---|---|---|---|---|---|---|---|
| 0 | Logistic Regression | HCDR | 92.13% | 92.15% | 92.13% | 0.752 | 0.7505 | 0.7496 | Active loan Percentage feature |
| 1 | Logistic Regression | HCDR | 92.13% | 92.15% | 92.13% | 0.7526 | 0.7511 | 0.7502 | Credit Income Ratio feature |
| 2 | Logistic Regression | HCDR | 92.13% | 92.15% | 92.13% | 0.7525 | 0.7511 | 0.7501 | Credit-Annuity Ratio of Current Application Fe... |
| 3 | Logistic Regression | HCDR | 92.14% | 92.15% | 92.13% | 0.7527 | 0.7513 | 0.7504 | Income-Annuity Ratio of Current Application Fe... |
credit_card = datasets['credit_card_balance']
credit_card[['SK_ID_CURR', 'SK_ID_PREV','SK_DPD']]
| SK_ID_CURR | SK_ID_PREV | SK_DPD | |
|---|---|---|---|
| 0 | 378907 | 2562384 | 0 |
| 1 | 363914 | 2582071 | 0 |
| 2 | 371185 | 1740877 | 0 |
| 3 | 337855 | 1389973 | 0 |
| 4 | 126868 | 1891521 | 0 |
| ... | ... | ... | ... |
| 3840307 | 328243 | 1036507 | 0 |
| 3840308 | 347207 | 1714892 | 0 |
| 3840309 | 215757 | 1302323 | 0 |
| 3840310 | 430337 | 1624872 | 0 |
| 3840311 | 236760 | 2411345 | 0 |
3840312 rows × 3 columns
credit_card['SK_DPD'].value_counts()
0 3686957
1 90369
8 2772
32 2340
7 1797
...
1520 1
1712 1
195 1
1306 1
446 1
Name: SK_DPD, Length: 917, dtype: int64
credit_avg_grp = credit_card.groupby(by = ['SK_ID_CURR'])['SK_DPD'].mean().reset_index().rename(index = str, columns = {'SK_DPD':'AVG_DPD'})
credit_card = credit_card.merge(credit_avg_grp, on = ['SK_ID_CURR'], how = 'left')
credit_card
| SK_ID_PREV | SK_ID_CURR | MONTHS_BALANCE | AMT_BALANCE | AMT_CREDIT_LIMIT_ACTUAL | AMT_DRAWINGS_ATM_CURRENT | AMT_DRAWINGS_CURRENT | AMT_DRAWINGS_OTHER_CURRENT | AMT_DRAWINGS_POS_CURRENT | AMT_INST_MIN_REGULARITY | ... | AMT_TOTAL_RECEIVABLE | CNT_DRAWINGS_ATM_CURRENT | CNT_DRAWINGS_CURRENT | CNT_DRAWINGS_OTHER_CURRENT | CNT_DRAWINGS_POS_CURRENT | CNT_INSTALMENT_MATURE_CUM | NAME_CONTRACT_STATUS | SK_DPD | SK_DPD_DEF | AVG_DPD | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 2562384 | 378907 | -6 | 56.970 | 135000 | 0.0 | 877.5 | 0.0 | 877.5 | 1700.325 | ... | 0.000 | 0.0 | 1 | 0.0 | 1.0 | 35.0 | Active | 0 | 0 | 0.127660 |
| 1 | 2582071 | 363914 | -1 | 63975.555 | 45000 | 2250.0 | 2250.0 | 0.0 | 0.0 | 2250.000 | ... | 64875.555 | 1.0 | 1 | 0.0 | 0.0 | 69.0 | Active | 0 | 0 | 0.010417 |
| 2 | 1740877 | 371185 | -7 | 31815.225 | 450000 | 0.0 | 0.0 | 0.0 | 0.0 | 2250.000 | ... | 31460.085 | 0.0 | 0 | 0.0 | 0.0 | 30.0 | Active | 0 | 0 | 0.000000 |
| 3 | 1389973 | 337855 | -4 | 236572.110 | 225000 | 2250.0 | 2250.0 | 0.0 | 0.0 | 11795.760 | ... | 233048.970 | 1.0 | 1 | 0.0 | 0.0 | 10.0 | Active | 0 | 0 | 0.000000 |
| 4 | 1891521 | 126868 | -1 | 453919.455 | 450000 | 0.0 | 11547.0 | 0.0 | 11547.0 | 22924.890 | ... | 453919.455 | 0.0 | 1 | 0.0 | 1.0 | 101.0 | Active | 0 | 0 | 0.010417 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 3840307 | 1036507 | 328243 | -9 | 0.000 | 45000 | NaN | 0.0 | NaN | NaN | 0.000 | ... | 0.000 | NaN | 0 | NaN | NaN | 0.0 | Active | 0 | 0 | 0.000000 |
| 3840308 | 1714892 | 347207 | -9 | 0.000 | 45000 | 0.0 | 0.0 | 0.0 | 0.0 | 0.000 | ... | 0.000 | 0.0 | 0 | 0.0 | 0.0 | 23.0 | Active | 0 | 0 | 0.000000 |
| 3840309 | 1302323 | 215757 | -9 | 275784.975 | 585000 | 270000.0 | 270000.0 | 0.0 | 0.0 | 2250.000 | ... | 273093.975 | 2.0 | 2 | 0.0 | 0.0 | 18.0 | Active | 0 | 0 | 0.000000 |
| 3840310 | 1624872 | 430337 | -10 | 0.000 | 450000 | NaN | 0.0 | NaN | NaN | 0.000 | ... | 0.000 | NaN | 0 | NaN | NaN | 0.0 | Active | 0 | 0 | 0.000000 |
| 3840311 | 2411345 | 236760 | -10 | 0.000 | 157500 | 0.0 | 0.0 | 0.0 | 0.0 | 0.000 | ... | 0.000 | 0.0 | 0 | 0.0 | 0.0 | 21.0 | Completed | 0 | 0 | 0.432432 |
3840312 rows × 24 columns
credit_card['AVG_DPD'].value_counts()
0.000000 2264271
0.010417 103968
0.020833 64224
0.031250 43648
0.010526 39615
...
28.750000 4
11.000000 4
5.750000 4
6.000000 3
9.000000 2
Name: AVG_DPD, Length: 3945, dtype: int64
credit_card['AVG_DPD'].isnull().sum()
0
credit_avg_dpd = credit_card[['SK_ID_CURR', 'AVG_DPD']]
credit_avg_dpd
| SK_ID_CURR | AVG_DPD | |
|---|---|---|
| 0 | 378907 | 0.127660 |
| 1 | 363914 | 0.010417 |
| 2 | 371185 | 0.000000 |
| 3 | 337855 | 0.000000 |
| 4 | 126868 | 0.010417 |
| ... | ... | ... |
| 3840307 | 328243 | 0.000000 |
| 3840308 | 347207 | 0.000000 |
| 3840309 | 215757 | 0.000000 |
| 3840310 | 430337 | 0.000000 |
| 3840311 | 236760 | 0.432432 |
3840312 rows × 2 columns
credit_avg_dpd = credit_avg_dpd[:1509345]
train
| SK_ID_CURR | TARGET | NAME_CONTRACT_TYPE | CODE_GENDER | FLAG_OWN_CAR | FLAG_OWN_REALTY | CNT_CHILDREN | AMT_INCOME_TOTAL | AMT_CREDIT | AMT_ANNUITY | ... | AMT_REQ_CREDIT_BUREAU_HOUR | AMT_REQ_CREDIT_BUREAU_DAY | AMT_REQ_CREDIT_BUREAU_WEEK | AMT_REQ_CREDIT_BUREAU_MON | AMT_REQ_CREDIT_BUREAU_QRT | AMT_REQ_CREDIT_BUREAU_YEAR | ACTIVE_LOANS_PERCENTAGE | CREDIT_INCOME_RATIO | YEARS_TO_PAY | INCOME_ANNUITY | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 100002 | 1 | Cash loans | M | N | Y | 0 | 202500.0 | 406597.5 | 24700.5 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.250000 | 0.498036 | 16.0 | 8.0 |
| 1 | 100002 | 1 | Cash loans | M | N | Y | 0 | 202500.0 | 406597.5 | 24700.5 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.250000 | 0.498036 | 16.0 | 8.0 |
| 2 | 100002 | 1 | Cash loans | M | N | Y | 0 | 202500.0 | 406597.5 | 24700.5 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.250000 | 0.498036 | 16.0 | 8.0 |
| 3 | 100002 | 1 | Cash loans | M | N | Y | 0 | 202500.0 | 406597.5 | 24700.5 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.250000 | 0.498036 | 16.0 | 8.0 |
| 4 | 100002 | 1 | Cash loans | M | N | Y | 0 | 202500.0 | 406597.5 | 24700.5 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.250000 | 0.498036 | 16.0 | 8.0 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 1509340 | 456255 | 0 | Cash loans | F | N | N | 0 | 157500.0 | 675000.0 | 49117.5 | ... | 0.0 | 0.0 | 0.0 | 2.0 | 0.0 | 1.0 | 0.454545 | 0.233333 | 14.0 | 3.0 |
| 1509341 | 456255 | 0 | Cash loans | F | N | N | 0 | 157500.0 | 675000.0 | 49117.5 | ... | 0.0 | 0.0 | 0.0 | 2.0 | 0.0 | 1.0 | 0.454545 | 0.233333 | 14.0 | 3.0 |
| 1509342 | 456255 | 0 | Cash loans | F | N | N | 0 | 157500.0 | 675000.0 | 49117.5 | ... | 0.0 | 0.0 | 0.0 | 2.0 | 0.0 | 1.0 | 0.454545 | 0.233333 | 14.0 | 3.0 |
| 1509343 | 456255 | 0 | Cash loans | F | N | N | 0 | 157500.0 | 675000.0 | 49117.5 | ... | 0.0 | 0.0 | 0.0 | 2.0 | 0.0 | 1.0 | 0.454545 | 0.233333 | 14.0 | 3.0 |
| 1509344 | 456255 | 0 | Cash loans | F | N | N | 0 | 157500.0 | 675000.0 | 49117.5 | ... | 0.0 | 0.0 | 0.0 | 2.0 | 0.0 | 1.0 | 0.454545 | 0.233333 | 14.0 | 3.0 |
1509345 rows × 101 columns
import gc
gc.collect()
20
train = train.merge(credit_avg_dpd, on = 'SK_ID_CURR', how = 'left')
train
| SK_ID_CURR | TARGET | NAME_CONTRACT_TYPE | CODE_GENDER | FLAG_OWN_CAR | FLAG_OWN_REALTY | CNT_CHILDREN | AMT_INCOME_TOTAL | AMT_CREDIT | AMT_ANNUITY | ... | AMT_REQ_CREDIT_BUREAU_DAY | AMT_REQ_CREDIT_BUREAU_WEEK | AMT_REQ_CREDIT_BUREAU_MON | AMT_REQ_CREDIT_BUREAU_QRT | AMT_REQ_CREDIT_BUREAU_YEAR | ACTIVE_LOANS_PERCENTAGE | CREDIT_INCOME_RATIO | YEARS_TO_PAY | INCOME_ANNUITY | AVG_DPD | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 100002 | 1 | Cash loans | M | N | Y | 0 | 202500.0 | 406597.5 | 24700.5 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.250000 | 0.498036 | 16.0 | 8.0 | NaN |
| 1 | 100002 | 1 | Cash loans | M | N | Y | 0 | 202500.0 | 406597.5 | 24700.5 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.250000 | 0.498036 | 16.0 | 8.0 | NaN |
| 2 | 100002 | 1 | Cash loans | M | N | Y | 0 | 202500.0 | 406597.5 | 24700.5 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.250000 | 0.498036 | 16.0 | 8.0 | NaN |
| 3 | 100002 | 1 | Cash loans | M | N | Y | 0 | 202500.0 | 406597.5 | 24700.5 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.250000 | 0.498036 | 16.0 | 8.0 | NaN |
| 4 | 100002 | 1 | Cash loans | M | N | Y | 0 | 202500.0 | 406597.5 | 24700.5 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.250000 | 0.498036 | 16.0 | 8.0 | NaN |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 8472905 | 456255 | 0 | Cash loans | F | N | N | 0 | 157500.0 | 675000.0 | 49117.5 | ... | 0.0 | 0.0 | 2.0 | 0.0 | 1.0 | 0.454545 | 0.233333 | 14.0 | 3.0 | NaN |
| 8472906 | 456255 | 0 | Cash loans | F | N | N | 0 | 157500.0 | 675000.0 | 49117.5 | ... | 0.0 | 0.0 | 2.0 | 0.0 | 1.0 | 0.454545 | 0.233333 | 14.0 | 3.0 | NaN |
| 8472907 | 456255 | 0 | Cash loans | F | N | N | 0 | 157500.0 | 675000.0 | 49117.5 | ... | 0.0 | 0.0 | 2.0 | 0.0 | 1.0 | 0.454545 | 0.233333 | 14.0 | 3.0 | NaN |
| 8472908 | 456255 | 0 | Cash loans | F | N | N | 0 | 157500.0 | 675000.0 | 49117.5 | ... | 0.0 | 0.0 | 2.0 | 0.0 | 1.0 | 0.454545 | 0.233333 | 14.0 | 3.0 | NaN |
| 8472909 | 456255 | 0 | Cash loans | F | N | N | 0 | 157500.0 | 675000.0 | 49117.5 | ... | 0.0 | 0.0 | 2.0 | 0.0 | 1.0 | 0.454545 | 0.233333 | 14.0 | 3.0 | NaN |
8472910 rows × 102 columns
train['AVG_DPD'].isnull().sum()
1054730
train['AVG_DPD'].value_counts()
0.000000 4209361
0.010417 210565
0.020833 122392
0.010526 87922
0.031250 83724
...
1.625000 3
2.923077 2
39.500000 2
9.200000 2
23.200000 1
Name: AVG_DPD, Length: 3603, dtype: int64
train = train[:1700000]
train
| SK_ID_CURR | TARGET | NAME_CONTRACT_TYPE | CODE_GENDER | FLAG_OWN_CAR | FLAG_OWN_REALTY | CNT_CHILDREN | AMT_INCOME_TOTAL | AMT_CREDIT | AMT_ANNUITY | ... | AMT_REQ_CREDIT_BUREAU_DAY | AMT_REQ_CREDIT_BUREAU_WEEK | AMT_REQ_CREDIT_BUREAU_MON | AMT_REQ_CREDIT_BUREAU_QRT | AMT_REQ_CREDIT_BUREAU_YEAR | ACTIVE_LOANS_PERCENTAGE | CREDIT_INCOME_RATIO | YEARS_TO_PAY | INCOME_ANNUITY | AVG_DPD | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 100002 | 1 | Cash loans | M | N | Y | 0 | 202500.0 | 406597.5 | 24700.5 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.25 | 0.498036 | 16.0 | 8.0 | NaN |
| 1 | 100002 | 1 | Cash loans | M | N | Y | 0 | 202500.0 | 406597.5 | 24700.5 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.25 | 0.498036 | 16.0 | 8.0 | NaN |
| 2 | 100002 | 1 | Cash loans | M | N | Y | 0 | 202500.0 | 406597.5 | 24700.5 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.25 | 0.498036 | 16.0 | 8.0 | NaN |
| 3 | 100002 | 1 | Cash loans | M | N | Y | 0 | 202500.0 | 406597.5 | 24700.5 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.25 | 0.498036 | 16.0 | 8.0 | NaN |
| 4 | 100002 | 1 | Cash loans | M | N | Y | 0 | 202500.0 | 406597.5 | 24700.5 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.25 | 0.498036 | 16.0 | 8.0 | NaN |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 1699995 | 172394 | 1 | Cash loans | M | Y | Y | 0 | 247500.0 | 509400.0 | 40374.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 3.0 | 0.60 | 0.485866 | 13.0 | 6.0 | 0.45 |
| 1699996 | 172394 | 1 | Cash loans | M | Y | Y | 0 | 247500.0 | 509400.0 | 40374.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 3.0 | 0.60 | 0.485866 | 13.0 | 6.0 | 0.45 |
| 1699997 | 172394 | 1 | Cash loans | M | Y | Y | 0 | 247500.0 | 509400.0 | 40374.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 3.0 | 0.60 | 0.485866 | 13.0 | 6.0 | 0.45 |
| 1699998 | 172394 | 1 | Cash loans | M | Y | Y | 0 | 247500.0 | 509400.0 | 40374.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 3.0 | 0.60 | 0.485866 | 13.0 | 6.0 | 0.45 |
| 1699999 | 172394 | 1 | Cash loans | M | Y | Y | 0 | 247500.0 | 509400.0 | 40374.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 3.0 | 0.60 | 0.485866 | 13.0 | 6.0 | 0.45 |
1700000 rows × 102 columns
train['AVG_DPD'].isnull().sum()
214613
train['AVG_DPD'].fillna(0, inplace = True)
train['AVG_DPD'].value_counts()
0.000000 1060899
0.010417 45943
0.020833 24935
0.031250 17953
0.010526 14949
...
1.636364 3
3.111111 3
1.818182 3
3.166667 2
1.666667 2
Name: AVG_DPD, Length: 1216, dtype: int64
train['AVG_DPD'].isnull().sum()
0
train
| SK_ID_CURR | TARGET | NAME_CONTRACT_TYPE | CODE_GENDER | FLAG_OWN_CAR | FLAG_OWN_REALTY | CNT_CHILDREN | AMT_INCOME_TOTAL | AMT_CREDIT | AMT_ANNUITY | ... | AMT_REQ_CREDIT_BUREAU_DAY | AMT_REQ_CREDIT_BUREAU_WEEK | AMT_REQ_CREDIT_BUREAU_MON | AMT_REQ_CREDIT_BUREAU_QRT | AMT_REQ_CREDIT_BUREAU_YEAR | ACTIVE_LOANS_PERCENTAGE | CREDIT_INCOME_RATIO | YEARS_TO_PAY | INCOME_ANNUITY | AVG_DPD | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 100002 | 1 | Cash loans | M | N | Y | 0 | 202500.0 | 406597.5 | 24700.5 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.25 | 0.498036 | 16.0 | 8.0 | 0.00 |
| 1 | 100002 | 1 | Cash loans | M | N | Y | 0 | 202500.0 | 406597.5 | 24700.5 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.25 | 0.498036 | 16.0 | 8.0 | 0.00 |
| 2 | 100002 | 1 | Cash loans | M | N | Y | 0 | 202500.0 | 406597.5 | 24700.5 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.25 | 0.498036 | 16.0 | 8.0 | 0.00 |
| 3 | 100002 | 1 | Cash loans | M | N | Y | 0 | 202500.0 | 406597.5 | 24700.5 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.25 | 0.498036 | 16.0 | 8.0 | 0.00 |
| 4 | 100002 | 1 | Cash loans | M | N | Y | 0 | 202500.0 | 406597.5 | 24700.5 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.25 | 0.498036 | 16.0 | 8.0 | 0.00 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 1699995 | 172394 | 1 | Cash loans | M | Y | Y | 0 | 247500.0 | 509400.0 | 40374.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 3.0 | 0.60 | 0.485866 | 13.0 | 6.0 | 0.45 |
| 1699996 | 172394 | 1 | Cash loans | M | Y | Y | 0 | 247500.0 | 509400.0 | 40374.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 3.0 | 0.60 | 0.485866 | 13.0 | 6.0 | 0.45 |
| 1699997 | 172394 | 1 | Cash loans | M | Y | Y | 0 | 247500.0 | 509400.0 | 40374.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 3.0 | 0.60 | 0.485866 | 13.0 | 6.0 | 0.45 |
| 1699998 | 172394 | 1 | Cash loans | M | Y | Y | 0 | 247500.0 | 509400.0 | 40374.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 3.0 | 0.60 | 0.485866 | 13.0 | 6.0 | 0.45 |
| 1699999 | 172394 | 1 | Cash loans | M | Y | Y | 0 | 247500.0 | 509400.0 | 40374.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 3.0 | 0.60 | 0.485866 | 13.0 | 6.0 | 0.45 |
1700000 rows × 102 columns
avg_dpd_model = model_logreg(train, 'Average DPD Feature')
X train shape: (1088000, 101) X validation shape: (272000, 101) X test shape: (340000, 101)
| SK_ID_CURR | CNT_CHILDREN | AMT_INCOME_TOTAL | AMT_CREDIT | AMT_ANNUITY | AMT_GOODS_PRICE | REGION_POPULATION_RELATIVE | DAYS_BIRTH | DAYS_EMPLOYED | DAYS_REGISTRATION | ... | HOUSETYPE_MODE_terraced house | WALLSMATERIAL_MODE_Block | WALLSMATERIAL_MODE_Mixed | WALLSMATERIAL_MODE_Monolithic | WALLSMATERIAL_MODE_Others | WALLSMATERIAL_MODE_Panel | WALLSMATERIAL_MODE_Stone, brick | WALLSMATERIAL_MODE_Wooden | EMERGENCYSTATE_MODE_No | EMERGENCYSTATE_MODE_Yes | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1.630563 | 3.585626 | -0.407585 | -0.095343 | 0.075151 | -0.415306 | 0.648845 | 1.061650 | -0.413921 | 1.356249 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 1.0 | 0.0 |
| 1 | 1.312255 | -0.555794 | -0.166599 | 0.406231 | 0.618199 | 0.173682 | -1.037355 | 0.789896 | -0.438971 | -0.657580 | ... | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 |
| 2 | 0.219138 | 0.824679 | 0.194881 | -0.552245 | 0.477013 | -0.415306 | 0.648845 | 0.646010 | -0.417775 | 1.104379 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 1.0 | 0.0 |
| 3 | -1.725651 | 0.824679 | -0.287092 | -0.357542 | 0.265558 | -0.203270 | 0.461723 | 0.972604 | -0.422241 | -0.600221 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 1.0 | 0.0 |
| 4 | -1.193248 | -0.555794 | -0.166599 | 0.120300 | 0.776871 | 0.055885 | -1.193243 | 1.575295 | -0.433909 | 0.002054 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 1.0 | 0.0 |
5 rows × 225 columns
No Skill: ROC AUC=0.500 Logistic: ROC AUC=0.763
| Pipeline | Dataset | TrainAcc | ValidAcc | TestAcc | TrainROC | TestROC | ValidROC | Feature Added | |
|---|---|---|---|---|---|---|---|---|---|
| 0 | Logistic Regression | HCDR | 92.13% | 92.15% | 92.13% | 0.752 | 0.7505 | 0.7496 | Active loan Percentage feature |
| 1 | Logistic Regression | HCDR | 92.13% | 92.15% | 92.13% | 0.7526 | 0.7511 | 0.7502 | Credit Income Ratio feature |
| 2 | Logistic Regression | HCDR | 92.13% | 92.15% | 92.13% | 0.7525 | 0.7511 | 0.7501 | Credit-Annuity Ratio of Current Application Fe... |
| 3 | Logistic Regression | HCDR | 92.14% | 92.15% | 92.13% | 0.7527 | 0.7513 | 0.7504 | Income-Annuity Ratio of Current Application Fe... |
| 4 | Logistic Regression | HCDR | 92.10% | 92.02% | 92.08% | 0.7599 | 0.7628 | 0.7625 | Average DPD Feature |
train
| SK_ID_CURR | TARGET | NAME_CONTRACT_TYPE | CODE_GENDER | FLAG_OWN_CAR | FLAG_OWN_REALTY | CNT_CHILDREN | AMT_INCOME_TOTAL | AMT_CREDIT | AMT_ANNUITY | ... | AMT_REQ_CREDIT_BUREAU_HOUR | AMT_REQ_CREDIT_BUREAU_DAY | AMT_REQ_CREDIT_BUREAU_WEEK | AMT_REQ_CREDIT_BUREAU_MON | AMT_REQ_CREDIT_BUREAU_QRT | AMT_REQ_CREDIT_BUREAU_YEAR | ACTIVE_LOANS_PERCENTAGE | CREDIT_INCOME_RATIO | YEARS_TO_PAY | INCOME_ANNUITY | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 100002 | 1 | Cash loans | M | N | Y | 0 | 202500.0 | 406597.5 | 24700.5 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.250000 | 0.498036 | 16.0 | 8.0 |
| 1 | 100002 | 1 | Cash loans | M | N | Y | 0 | 202500.0 | 406597.5 | 24700.5 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.250000 | 0.498036 | 16.0 | 8.0 |
| 2 | 100002 | 1 | Cash loans | M | N | Y | 0 | 202500.0 | 406597.5 | 24700.5 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.250000 | 0.498036 | 16.0 | 8.0 |
| 3 | 100002 | 1 | Cash loans | M | N | Y | 0 | 202500.0 | 406597.5 | 24700.5 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.250000 | 0.498036 | 16.0 | 8.0 |
| 4 | 100002 | 1 | Cash loans | M | N | Y | 0 | 202500.0 | 406597.5 | 24700.5 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.250000 | 0.498036 | 16.0 | 8.0 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 1509340 | 456255 | 0 | Cash loans | F | N | N | 0 | 157500.0 | 675000.0 | 49117.5 | ... | 0.0 | 0.0 | 0.0 | 2.0 | 0.0 | 1.0 | 0.454545 | 0.233333 | 14.0 | 3.0 |
| 1509341 | 456255 | 0 | Cash loans | F | N | N | 0 | 157500.0 | 675000.0 | 49117.5 | ... | 0.0 | 0.0 | 0.0 | 2.0 | 0.0 | 1.0 | 0.454545 | 0.233333 | 14.0 | 3.0 |
| 1509342 | 456255 | 0 | Cash loans | F | N | N | 0 | 157500.0 | 675000.0 | 49117.5 | ... | 0.0 | 0.0 | 0.0 | 2.0 | 0.0 | 1.0 | 0.454545 | 0.233333 | 14.0 | 3.0 |
| 1509343 | 456255 | 0 | Cash loans | F | N | N | 0 | 157500.0 | 675000.0 | 49117.5 | ... | 0.0 | 0.0 | 0.0 | 2.0 | 0.0 | 1.0 | 0.454545 | 0.233333 | 14.0 | 3.0 |
| 1509344 | 456255 | 0 | Cash loans | F | N | N | 0 | 157500.0 | 675000.0 | 49117.5 | ... | 0.0 | 0.0 | 0.0 | 2.0 | 0.0 | 1.0 | 0.454545 | 0.233333 | 14.0 | 3.0 |
1509345 rows × 54 columns
train_features = train
train_features = train_features.loc[:, ~train_features.columns.str.endswith("MODE")]
train_features = train_features.loc[:, ~train_features.columns.str.endswith("MEDI")]
train_features = train_features.loc[:, ~train_features.columns.str.endswith("AVG")]
train_features
| SK_ID_CURR | TARGET | NAME_CONTRACT_TYPE | CODE_GENDER | FLAG_OWN_CAR | FLAG_OWN_REALTY | CNT_CHILDREN | AMT_INCOME_TOTAL | AMT_CREDIT | AMT_ANNUITY | ... | AMT_REQ_CREDIT_BUREAU_HOUR | AMT_REQ_CREDIT_BUREAU_DAY | AMT_REQ_CREDIT_BUREAU_WEEK | AMT_REQ_CREDIT_BUREAU_MON | AMT_REQ_CREDIT_BUREAU_QRT | AMT_REQ_CREDIT_BUREAU_YEAR | ACTIVE_LOANS_PERCENTAGE | CREDIT_INCOME_RATIO | YEARS_TO_PAY | INCOME_ANNUITY | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 100002 | 1 | Cash loans | M | N | Y | 0 | 202500.0 | 406597.5 | 24700.5 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.250000 | 0.498036 | 16.0 | 8.0 |
| 1 | 100002 | 1 | Cash loans | M | N | Y | 0 | 202500.0 | 406597.5 | 24700.5 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.250000 | 0.498036 | 16.0 | 8.0 |
| 2 | 100002 | 1 | Cash loans | M | N | Y | 0 | 202500.0 | 406597.5 | 24700.5 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.250000 | 0.498036 | 16.0 | 8.0 |
| 3 | 100002 | 1 | Cash loans | M | N | Y | 0 | 202500.0 | 406597.5 | 24700.5 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.250000 | 0.498036 | 16.0 | 8.0 |
| 4 | 100002 | 1 | Cash loans | M | N | Y | 0 | 202500.0 | 406597.5 | 24700.5 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.250000 | 0.498036 | 16.0 | 8.0 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 1509340 | 456255 | 0 | Cash loans | F | N | N | 0 | 157500.0 | 675000.0 | 49117.5 | ... | 0.0 | 0.0 | 0.0 | 2.0 | 0.0 | 1.0 | 0.454545 | 0.233333 | 14.0 | 3.0 |
| 1509341 | 456255 | 0 | Cash loans | F | N | N | 0 | 157500.0 | 675000.0 | 49117.5 | ... | 0.0 | 0.0 | 0.0 | 2.0 | 0.0 | 1.0 | 0.454545 | 0.233333 | 14.0 | 3.0 |
| 1509342 | 456255 | 0 | Cash loans | F | N | N | 0 | 157500.0 | 675000.0 | 49117.5 | ... | 0.0 | 0.0 | 0.0 | 2.0 | 0.0 | 1.0 | 0.454545 | 0.233333 | 14.0 | 3.0 |
| 1509343 | 456255 | 0 | Cash loans | F | N | N | 0 | 157500.0 | 675000.0 | 49117.5 | ... | 0.0 | 0.0 | 0.0 | 2.0 | 0.0 | 1.0 | 0.454545 | 0.233333 | 14.0 | 3.0 |
| 1509344 | 456255 | 0 | Cash loans | F | N | N | 0 | 157500.0 | 675000.0 | 49117.5 | ... | 0.0 | 0.0 | 0.0 | 2.0 | 0.0 | 1.0 | 0.454545 | 0.233333 | 14.0 | 3.0 |
1509345 rows × 54 columns
train_features['TARGET'].value_counts()
0 1390368 1 118977 Name: TARGET, dtype: int64
train_zero_fe = train_features[train_features['TARGET'] == 0][:300000]
train_zero_fe
| SK_ID_CURR | TARGET | NAME_CONTRACT_TYPE | CODE_GENDER | FLAG_OWN_CAR | FLAG_OWN_REALTY | CNT_CHILDREN | AMT_INCOME_TOTAL | AMT_CREDIT | AMT_ANNUITY | ... | AMT_REQ_CREDIT_BUREAU_HOUR | AMT_REQ_CREDIT_BUREAU_DAY | AMT_REQ_CREDIT_BUREAU_WEEK | AMT_REQ_CREDIT_BUREAU_MON | AMT_REQ_CREDIT_BUREAU_QRT | AMT_REQ_CREDIT_BUREAU_YEAR | ACTIVE_LOANS_PERCENTAGE | CREDIT_INCOME_RATIO | YEARS_TO_PAY | INCOME_ANNUITY | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 8 | 100003 | 0 | Cash loans | F | N | N | 0 | 270000.0 | 1293502.5 | 35698.5 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.25 | 0.208736 | 36.0 | 8.0 |
| 9 | 100003 | 0 | Cash loans | F | N | N | 0 | 270000.0 | 1293502.5 | 35698.5 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.25 | 0.208736 | 36.0 | 8.0 |
| 10 | 100003 | 0 | Cash loans | F | N | N | 0 | 270000.0 | 1293502.5 | 35698.5 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.25 | 0.208736 | 36.0 | 8.0 |
| 11 | 100003 | 0 | Cash loans | F | N | N | 0 | 270000.0 | 1293502.5 | 35698.5 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.25 | 0.208736 | 36.0 | 8.0 |
| 12 | 100004 | 0 | Revolving loans | M | Y | Y | 0 | 67500.0 | 135000.0 | 6750.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.00 | 0.500000 | 20.0 | 10.0 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 326055 | 177024 | 0 | Cash loans | F | N | Y | 0 | 193500.0 | 1062027.0 | 31180.5 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 1.0 | 0.50 | 0.182199 | 34.0 | 6.0 |
| 326056 | 177024 | 0 | Cash loans | F | N | Y | 0 | 193500.0 | 1062027.0 | 31180.5 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 1.0 | 0.50 | 0.182199 | 34.0 | 6.0 |
| 326057 | 177025 | 0 | Cash loans | F | Y | Y | 0 | 270000.0 | 1304658.0 | 42214.5 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 4.0 | 0.25 | 0.206951 | 31.0 | 6.0 |
| 326058 | 177025 | 0 | Cash loans | F | Y | Y | 0 | 270000.0 | 1304658.0 | 42214.5 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 4.0 | 0.25 | 0.206951 | 31.0 | 6.0 |
| 326059 | 177025 | 0 | Cash loans | F | Y | Y | 0 | 270000.0 | 1304658.0 | 42214.5 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 4.0 | 0.25 | 0.206951 | 31.0 | 6.0 |
300000 rows × 54 columns
train_one_fe = train_features[train_features['TARGET'] == 1]
train_one_fe
| SK_ID_CURR | TARGET | NAME_CONTRACT_TYPE | CODE_GENDER | FLAG_OWN_CAR | FLAG_OWN_REALTY | CNT_CHILDREN | AMT_INCOME_TOTAL | AMT_CREDIT | AMT_ANNUITY | ... | AMT_REQ_CREDIT_BUREAU_HOUR | AMT_REQ_CREDIT_BUREAU_DAY | AMT_REQ_CREDIT_BUREAU_WEEK | AMT_REQ_CREDIT_BUREAU_MON | AMT_REQ_CREDIT_BUREAU_QRT | AMT_REQ_CREDIT_BUREAU_YEAR | ACTIVE_LOANS_PERCENTAGE | CREDIT_INCOME_RATIO | YEARS_TO_PAY | INCOME_ANNUITY | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 100002 | 1 | Cash loans | M | N | Y | 0 | 202500.0 | 406597.5 | 24700.5 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.250000 | 0.498036 | 16.0 | 8.0 |
| 1 | 100002 | 1 | Cash loans | M | N | Y | 0 | 202500.0 | 406597.5 | 24700.5 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.250000 | 0.498036 | 16.0 | 8.0 |
| 2 | 100002 | 1 | Cash loans | M | N | Y | 0 | 202500.0 | 406597.5 | 24700.5 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.250000 | 0.498036 | 16.0 | 8.0 |
| 3 | 100002 | 1 | Cash loans | M | N | Y | 0 | 202500.0 | 406597.5 | 24700.5 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.250000 | 0.498036 | 16.0 | 8.0 |
| 4 | 100002 | 1 | Cash loans | M | N | Y | 0 | 202500.0 | 406597.5 | 24700.5 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.250000 | 0.498036 | 16.0 | 8.0 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 1509183 | 456225 | 1 | Cash loans | M | N | Y | 0 | 225000.0 | 297000.0 | 19975.5 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 3.0 | 0.666667 | 0.757576 | 15.0 | 11.0 |
| 1509184 | 456225 | 1 | Cash loans | M | N | Y | 0 | 225000.0 | 297000.0 | 19975.5 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 3.0 | 0.666667 | 0.757576 | 15.0 | 11.0 |
| 1509185 | 456225 | 1 | Cash loans | M | N | Y | 0 | 225000.0 | 297000.0 | 19975.5 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 3.0 | 0.666667 | 0.757576 | 15.0 | 11.0 |
| 1509215 | 456233 | 1 | Cash loans | F | N | Y | 0 | 225000.0 | 521280.0 | 23089.5 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 2.0 | 1.000000 | 0.431630 | 23.0 | 10.0 |
| 1509333 | 456254 | 1 | Cash loans | F | N | Y | 0 | 171000.0 | 370107.0 | 20205.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.000000 | 0.462029 | 18.0 | 8.0 |
118977 rows × 54 columns
train_features_fe = pd.concat([train_zero_fe, train_one_fe], axis = 0, ignore_index = True)
train_features_fe
| SK_ID_CURR | TARGET | NAME_CONTRACT_TYPE | CODE_GENDER | FLAG_OWN_CAR | FLAG_OWN_REALTY | CNT_CHILDREN | AMT_INCOME_TOTAL | AMT_CREDIT | AMT_ANNUITY | ... | AMT_REQ_CREDIT_BUREAU_HOUR | AMT_REQ_CREDIT_BUREAU_DAY | AMT_REQ_CREDIT_BUREAU_WEEK | AMT_REQ_CREDIT_BUREAU_MON | AMT_REQ_CREDIT_BUREAU_QRT | AMT_REQ_CREDIT_BUREAU_YEAR | ACTIVE_LOANS_PERCENTAGE | CREDIT_INCOME_RATIO | YEARS_TO_PAY | INCOME_ANNUITY | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 100003 | 0 | Cash loans | F | N | N | 0 | 270000.0 | 1293502.5 | 35698.5 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.250000 | 0.208736 | 36.0 | 8.0 |
| 1 | 100003 | 0 | Cash loans | F | N | N | 0 | 270000.0 | 1293502.5 | 35698.5 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.250000 | 0.208736 | 36.0 | 8.0 |
| 2 | 100003 | 0 | Cash loans | F | N | N | 0 | 270000.0 | 1293502.5 | 35698.5 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.250000 | 0.208736 | 36.0 | 8.0 |
| 3 | 100003 | 0 | Cash loans | F | N | N | 0 | 270000.0 | 1293502.5 | 35698.5 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.250000 | 0.208736 | 36.0 | 8.0 |
| 4 | 100004 | 0 | Revolving loans | M | Y | Y | 0 | 67500.0 | 135000.0 | 6750.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.000000 | 0.500000 | 20.0 | 10.0 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 418972 | 456225 | 1 | Cash loans | M | N | Y | 0 | 225000.0 | 297000.0 | 19975.5 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 3.0 | 0.666667 | 0.757576 | 15.0 | 11.0 |
| 418973 | 456225 | 1 | Cash loans | M | N | Y | 0 | 225000.0 | 297000.0 | 19975.5 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 3.0 | 0.666667 | 0.757576 | 15.0 | 11.0 |
| 418974 | 456225 | 1 | Cash loans | M | N | Y | 0 | 225000.0 | 297000.0 | 19975.5 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 3.0 | 0.666667 | 0.757576 | 15.0 | 11.0 |
| 418975 | 456233 | 1 | Cash loans | F | N | Y | 0 | 225000.0 | 521280.0 | 23089.5 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 2.0 | 1.000000 | 0.431630 | 23.0 | 10.0 |
| 418976 | 456254 | 1 | Cash loans | F | N | Y | 0 | 171000.0 | 370107.0 | 20205.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.000000 | 0.462029 | 18.0 | 8.0 |
418977 rows × 54 columns
train_features_fe.columns
Index(['SK_ID_CURR', 'TARGET', 'NAME_CONTRACT_TYPE', 'CODE_GENDER',
'FLAG_OWN_CAR', 'FLAG_OWN_REALTY', 'CNT_CHILDREN', 'AMT_INCOME_TOTAL',
'AMT_CREDIT', 'AMT_ANNUITY', 'AMT_GOODS_PRICE', 'NAME_TYPE_SUITE',
'NAME_INCOME_TYPE', 'NAME_EDUCATION_TYPE', 'NAME_FAMILY_STATUS',
'NAME_HOUSING_TYPE', 'REGION_POPULATION_RELATIVE', 'DAYS_BIRTH',
'DAYS_EMPLOYED', 'DAYS_REGISTRATION', 'DAYS_ID_PUBLISH', 'OWN_CAR_AGE',
'FLAG_MOBIL', 'FLAG_EMP_PHONE', 'FLAG_WORK_PHONE', 'FLAG_CONT_MOBILE',
'FLAG_PHONE', 'FLAG_EMAIL', 'OCCUPATION_TYPE', 'CNT_FAM_MEMBERS',
'REGION_RATING_CLIENT', 'REGION_RATING_CLIENT_W_CITY',
'WEEKDAY_APPR_PROCESS_START', 'HOUR_APPR_PROCESS_START',
'REG_REGION_NOT_LIVE_REGION', 'REG_REGION_NOT_WORK_REGION',
'LIVE_REGION_NOT_WORK_REGION', 'REG_CITY_NOT_LIVE_CITY',
'REG_CITY_NOT_WORK_CITY', 'LIVE_CITY_NOT_WORK_CITY',
'ORGANIZATION_TYPE', 'EXT_SOURCE_1', 'EXT_SOURCE_2', 'EXT_SOURCE_3',
'AMT_REQ_CREDIT_BUREAU_HOUR', 'AMT_REQ_CREDIT_BUREAU_DAY',
'AMT_REQ_CREDIT_BUREAU_WEEK', 'AMT_REQ_CREDIT_BUREAU_MON',
'AMT_REQ_CREDIT_BUREAU_QRT', 'AMT_REQ_CREDIT_BUREAU_YEAR',
'ACTIVE_LOANS_PERCENTAGE', 'CREDIT_INCOME_RATIO', 'YEARS_TO_PAY',
'INCOME_ANNUITY', 'AVG_DPD'],
dtype='object')
train_features_fe.drop(['OVERDUE_DEBT_RATIO'], axis = 1, inplace = True)
train_features_fe.drop(['SK_ID_CURR'], axis = 1, inplace = True)
train_features_fe.drop(['AVG_DPD'], axis = 1, inplace = True)
after_balancing_model = model_logreg(train_features_fe, 'After balancing data')
X train shape: (268144, 52) X validation shape: (67037, 52) X test shape: (83796, 52)
| CNT_CHILDREN | AMT_INCOME_TOTAL | AMT_CREDIT | AMT_ANNUITY | AMT_GOODS_PRICE | REGION_POPULATION_RELATIVE | DAYS_BIRTH | DAYS_EMPLOYED | DAYS_REGISTRATION | DAYS_ID_PUBLISH | ... | ORGANIZATION_TYPE_Trade: type 4 | ORGANIZATION_TYPE_Trade: type 5 | ORGANIZATION_TYPE_Trade: type 6 | ORGANIZATION_TYPE_Trade: type 7 | ORGANIZATION_TYPE_Transport: type 1 | ORGANIZATION_TYPE_Transport: type 2 | ORGANIZATION_TYPE_Transport: type 3 | ORGANIZATION_TYPE_Transport: type 4 | ORGANIZATION_TYPE_University | ORGANIZATION_TYPE_XNA | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0.766147 | 0.189471 | -0.177851 | -0.696794 | -0.266415 | -0.655574 | 1.296346 | -0.436598 | 0.125260 | -0.060922 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
| 1 | -0.593431 | -0.393030 | -0.641324 | -0.714714 | -0.525290 | -1.019960 | -1.610581 | 2.257097 | -1.785168 | -1.030108 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 |
| 2 | -0.593431 | -0.238303 | -0.242989 | -0.377809 | -0.241761 | -0.744823 | -0.965262 | 2.257097 | -1.475392 | -0.352625 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 |
| 3 | -0.593431 | -0.356624 | -1.342220 | -1.397321 | -1.289585 | -0.149341 | -1.846200 | 2.257097 | -1.968100 | -0.871735 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 |
| 4 | -0.593431 | -0.083576 | -0.034710 | -0.050353 | -0.278743 | 0.826867 | -1.078762 | -0.438052 | -0.559583 | -0.691028 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
5 rows × 164 columns
No Skill: ROC AUC=0.500 Logistic: ROC AUC=0.751
| Pipeline | Dataset | TrainAcc | ValidAcc | TestAcc | TrainROC | TestROC | ValidROC | Feature Added | |
|---|---|---|---|---|---|---|---|---|---|
| 0 | Logistic Regression | HCDR | 94.18% | 94.12% | 94.15% | 0.9709 | 0.97 | 0.9708 | After balancing data |
| 1 | Logistic Regression | HCDR | 93.61% | 93.56% | 93.55% | 0.9689 | 0.9682 | 0.9688 | After balancing data |
| 2 | Logistic Regression | HCDR | 75.18% | 75.00% | 74.95% | 0.7525 | 0.7506 | 0.752 | After balancing data |
avg_dpd_model_1 = model_logreg(train_features_fe, 'After balancing data')
X train shape: (269028, 52) X validation shape: (67257, 52) X test shape: (84072, 52)
| CNT_CHILDREN | AMT_INCOME_TOTAL | AMT_CREDIT | AMT_ANNUITY | AMT_GOODS_PRICE | REGION_POPULATION_RELATIVE | DAYS_BIRTH | DAYS_EMPLOYED | DAYS_REGISTRATION | DAYS_ID_PUBLISH | ... | ORGANIZATION_TYPE_Trade: type 2 | ORGANIZATION_TYPE_Trade: type 3 | ORGANIZATION_TYPE_Trade: type 6 | ORGANIZATION_TYPE_Trade: type 7 | ORGANIZATION_TYPE_Transport: type 1 | ORGANIZATION_TYPE_Transport: type 2 | ORGANIZATION_TYPE_Transport: type 3 | ORGANIZATION_TYPE_Transport: type 4 | ORGANIZATION_TYPE_University | ORGANIZATION_TYPE_XNA | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 2.001085 | -0.879924 | -0.120733 | -0.350369 | -0.264956 | -1.119495 | 0.689028 | -0.446984 | 0.445728 | -0.094278 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
| 1 | -0.597219 | -0.230093 | -0.213129 | -0.133947 | -0.025806 | 0.349457 | -1.752685 | 2.307547 | -1.338608 | -0.478887 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 |
| 2 | 0.701933 | 1.502791 | -0.105438 | -0.050035 | -0.250888 | -0.160970 | 0.727303 | -0.449342 | -0.597991 | -0.832887 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 |
| 3 | 0.701933 | -0.533347 | 0.571544 | 1.298998 | 0.677577 | -0.789478 | 1.249778 | -0.412017 | 0.695357 | -0.503507 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
| 4 | -0.597219 | 0.636349 | 0.598449 | 0.283711 | 0.396224 | 4.147299 | -0.873154 | -0.438274 | 1.159757 | 0.232440 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
5 rows × 152 columns
No Skill: ROC AUC=0.500 Logistic: ROC AUC=0.968
| Pipeline | Dataset | TrainAcc | ValidAcc | TestAcc | TrainROC | TestROC | ValidROC | Feature Added | |
|---|---|---|---|---|---|---|---|---|---|
| 0 | Logistic Regression | HCDR | 94.18% | 94.12% | 94.15% | 0.9709 | 0.97 | 0.9708 | After balancing data |
| 1 | Logistic Regression | HCDR | 93.61% | 93.56% | 93.55% | 0.9689 | 0.9682 | 0.9688 | After balancing data |
avg_dpd_model_1 = model_logreg(train_features_fe, 'After balancing data')
X train shape: (269028, 53) X validation shape: (67257, 53) X test shape: (84072, 53)
| CNT_CHILDREN | AMT_INCOME_TOTAL | AMT_CREDIT | AMT_ANNUITY | AMT_GOODS_PRICE | REGION_POPULATION_RELATIVE | DAYS_BIRTH | DAYS_EMPLOYED | DAYS_REGISTRATION | DAYS_ID_PUBLISH | ... | ORGANIZATION_TYPE_Trade: type 2 | ORGANIZATION_TYPE_Trade: type 3 | ORGANIZATION_TYPE_Trade: type 6 | ORGANIZATION_TYPE_Trade: type 7 | ORGANIZATION_TYPE_Transport: type 1 | ORGANIZATION_TYPE_Transport: type 2 | ORGANIZATION_TYPE_Transport: type 3 | ORGANIZATION_TYPE_Transport: type 4 | ORGANIZATION_TYPE_University | ORGANIZATION_TYPE_XNA | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 2.001085 | -0.879924 | -0.120733 | -0.350369 | -0.264956 | -1.119495 | 0.689028 | -0.446984 | 0.445728 | -0.094278 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
| 1 | -0.597219 | -0.230093 | -0.213129 | -0.133947 | -0.025806 | 0.349457 | -1.752685 | 2.307547 | -1.338608 | -0.478887 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 |
| 2 | 0.701933 | 1.502791 | -0.105438 | -0.050035 | -0.250888 | -0.160970 | 0.727303 | -0.449342 | -0.597991 | -0.832887 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 |
| 3 | 0.701933 | -0.533347 | 0.571544 | 1.298998 | 0.677577 | -0.789478 | 1.249778 | -0.412017 | 0.695357 | -0.503507 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
| 4 | -0.597219 | 0.636349 | 0.598449 | 0.283711 | 0.396224 | 4.147299 | -0.873154 | -0.438274 | 1.159757 | 0.232440 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
5 rows × 153 columns
No Skill: ROC AUC=0.500 Logistic: ROC AUC=0.970
| Pipeline | Dataset | TrainAcc | ValidAcc | TestAcc | TrainROC | TestROC | ValidROC | Feature Added | |
|---|---|---|---|---|---|---|---|---|---|
| 0 | Logistic Regression | HCDR | 94.18% | 94.12% | 94.15% | 0.9709 | 0.97 | 0.9708 | After balancing data |
bureau_data
| SK_ID_CURR | SK_ID_BUREAU | CREDIT_ACTIVE | DAYS_CREDIT | CREDIT_DAY_OVERDUE | AMT_CREDIT_MAX_OVERDUE | CNT_CREDIT_PROLONG | AMT_CREDIT_SUM | AMT_CREDIT_SUM_DEBT | AMT_CREDIT_SUM_LIMIT | AMT_CREDIT_SUM_OVERDUE | CREDIT_TYPE | DAYS_CREDIT_UPDATE | ACTIVE_LOANS_PERCENTAGE | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 215354 | 5714462 | Closed | -497 | 0 | 0.0 | 0 | 91323.00 | 0.0 | 0.0 | 0.0 | Consumer credit | -131 | 0.545455 |
| 1 | 215354 | 5714463 | Active | -208 | 0 | 0.0 | 0 | 225000.00 | 171342.0 | 0.0 | 0.0 | Credit card | -20 | 0.545455 |
| 2 | 215354 | 5714464 | Active | -203 | 0 | 0.0 | 0 | 464323.50 | 464323.5 | 0.0 | 0.0 | Consumer credit | -16 | 0.545455 |
| 3 | 215354 | 5714465 | Active | -203 | 0 | 0.0 | 0 | 90000.00 | 90000.0 | 0.0 | 0.0 | Credit card | -16 | 0.545455 |
| 4 | 215354 | 5714466 | Active | -629 | 0 | 77674.5 | 0 | 2700000.00 | 2700000.0 | 0.0 | 0.0 | Consumer credit | -21 | 0.545455 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 1716423 | 259355 | 5057750 | Active | -44 | 0 | 0.0 | 0 | 11250.00 | 11250.0 | 0.0 | 0.0 | Microloan | -19 | 1.000000 |
| 1716424 | 100044 | 5057754 | Closed | -2648 | 0 | 5476.5 | 0 | 38130.84 | 0.0 | 0.0 | 0.0 | Consumer credit | -2493 | 0.545455 |
| 1716425 | 100044 | 5057762 | Closed | -1809 | 0 | 0.0 | 0 | 15570.00 | 15570.0 | 0.0 | 0.0 | Consumer credit | -967 | 0.545455 |
| 1716426 | 246829 | 5057770 | Closed | -1878 | 0 | 0.0 | 0 | 36000.00 | 0.0 | 0.0 | 0.0 | Consumer credit | -1508 | 0.258065 |
| 1716427 | 246829 | 5057778 | Closed | -463 | 0 | 0.0 | 0 | 22500.00 | 0.0 | 0.0 | 0.0 | Microloan | -387 | 0.258065 |
1716428 rows × 14 columns
msno.bar(bureau_data)
plt.show()
total_debt_grpby = bureau_data[['SK_ID_CURR', 'AMT_CREDIT_SUM_DEBT']].groupby(by = ['SK_ID_CURR'])['AMT_CREDIT_SUM_DEBT'].sum().reset_index().rename( index = str, columns = { 'AMT_CREDIT_SUM_DEBT': 'TOTAL_CUSTOMER_DEBT'})
total_overdue_grpby = bureau_data[['SK_ID_CURR', 'AMT_CREDIT_SUM_OVERDUE']].groupby(by = ['SK_ID_CURR'])['AMT_CREDIT_SUM_OVERDUE'].sum().reset_index().rename( index = str, columns = { 'AMT_CREDIT_SUM_OVERDUE': 'TOTAL_CUSTOMER_OVERDUE'})
bureau_data = bureau_data.merge(total_debt_grpby, on = ['SK_ID_CURR'], how = 'left')
bureau_data = bureau_data.merge(total_overdue_grpby, on = ['SK_ID_CURR'], how = 'left')
del total_debt_grpby
del total_overdue_grpby
bureau_data[['SK_ID_CURR', 'TOTAL_CUSTOMER_DEBT','TOTAL_CUSTOMER_OVERDUE']]
| SK_ID_CURR | TOTAL_CUSTOMER_DEBT | TOTAL_CUSTOMER_OVERDUE | |
|---|---|---|---|
| 0 | 215354 | 4141399.680 | 0.0 |
| 1 | 215354 | 4141399.680 | 0.0 |
| 2 | 215354 | 4141399.680 | 0.0 |
| 3 | 215354 | 4141399.680 | 0.0 |
| 4 | 215354 | 4141399.680 | 0.0 |
| ... | ... | ... | ... |
| 1716423 | 259355 | 22500.000 | 0.0 |
| 1716424 | 100044 | 1523334.600 | 0.0 |
| 1716425 | 100044 | 1523334.600 | 0.0 |
| 1716426 | 246829 | 237417.525 | 0.0 |
| 1716427 | 246829 | 237417.525 | 0.0 |
1716428 rows × 3 columns
bureau_data['TOTAL_CUSTOMER_OVERDUE'].value_counts()
0.000 1686808
4.500 2152
9.000 755
13.500 642
22.500 498
...
402.165 1
1361214.000 1
1617403.500 1
18920.250 1
20263.500 1
Name: TOTAL_CUSTOMER_OVERDUE, Length: 1369, dtype: int64
bureau_data['TOTAL_CUSTOMER_DEBT'].value_counts()
0.00 133334
225000.00 2392
450000.00 2009
90000.00 1245
675000.00 1182
...
11614.50 1
87561.00 1
187132.50 1
1871731.62 1
175054.50 1
Name: TOTAL_CUSTOMER_DEBT, Length: 201075, dtype: int64
bureau_data['OVERDUE_DEBT_RATIO'] = bureau_data['TOTAL_CUSTOMER_OVERDUE']/bureau_data['TOTAL_CUSTOMER_DEBT']
np.isinf(bureau_data[['OVERDUE_DEBT_RATIO']]).values.sum()
418
index_to_drop = bureau_data.loc[bureau_data['OVERDUE_DEBT_RATIO'] == np.inf]
len(index_to_drop)
418
bureau_data.drop(index_to_drop.index, inplace = True)
np.isinf(bureau_data[['OVERDUE_DEBT_RATIO']]).values.sum()
0
bureau_data['OVERDUE_DEBT_RATIO'].value_counts()
0.000000 1553892
1.000000 160
0.000030 61
0.014851 51
0.000059 47
...
0.000233 1
0.344647 1
0.021647 1
0.080689 1
0.001196 1
Name: OVERDUE_DEBT_RATIO, Length: 3630, dtype: int64
debt_ovd = bureau_data[['SK_ID_CURR','OVERDUE_DEBT_RATIO']]
debt_ovd
| SK_ID_CURR | OVERDUE_DEBT_RATIO | |
|---|---|---|
| 0 | 215354 | 0.0 |
| 1 | 215354 | 0.0 |
| 2 | 215354 | 0.0 |
| 3 | 215354 | 0.0 |
| 4 | 215354 | 0.0 |
| ... | ... | ... |
| 1716423 | 259355 | 0.0 |
| 1716424 | 100044 | 0.0 |
| 1716425 | 100044 | 0.0 |
| 1716426 | 246829 | 0.0 |
| 1716427 | 246829 | 0.0 |
1716010 rows × 2 columns
train
| SK_ID_CURR | TARGET | NAME_CONTRACT_TYPE | CODE_GENDER | FLAG_OWN_CAR | FLAG_OWN_REALTY | CNT_CHILDREN | AMT_INCOME_TOTAL | AMT_CREDIT | AMT_ANNUITY | ... | AMT_REQ_CREDIT_BUREAU_DAY | AMT_REQ_CREDIT_BUREAU_WEEK | AMT_REQ_CREDIT_BUREAU_MON | AMT_REQ_CREDIT_BUREAU_QRT | AMT_REQ_CREDIT_BUREAU_YEAR | ACTIVE_LOANS_PERCENTAGE | CREDIT_INCOME_RATIO | YEARS_TO_PAY | INCOME_ANNUITY | AVG_DPD | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 100002 | 1 | Cash loans | M | N | Y | 0 | 202500.0 | 406597.5 | 24700.5 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.25 | 0.498036 | 16.0 | 8.0 | 0.00 |
| 1 | 100002 | 1 | Cash loans | M | N | Y | 0 | 202500.0 | 406597.5 | 24700.5 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.25 | 0.498036 | 16.0 | 8.0 | 0.00 |
| 2 | 100002 | 1 | Cash loans | M | N | Y | 0 | 202500.0 | 406597.5 | 24700.5 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.25 | 0.498036 | 16.0 | 8.0 | 0.00 |
| 3 | 100002 | 1 | Cash loans | M | N | Y | 0 | 202500.0 | 406597.5 | 24700.5 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.25 | 0.498036 | 16.0 | 8.0 | 0.00 |
| 4 | 100002 | 1 | Cash loans | M | N | Y | 0 | 202500.0 | 406597.5 | 24700.5 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.25 | 0.498036 | 16.0 | 8.0 | 0.00 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 1699995 | 172394 | 1 | Cash loans | M | Y | Y | 0 | 247500.0 | 509400.0 | 40374.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 3.0 | 0.60 | 0.485866 | 13.0 | 6.0 | 0.45 |
| 1699996 | 172394 | 1 | Cash loans | M | Y | Y | 0 | 247500.0 | 509400.0 | 40374.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 3.0 | 0.60 | 0.485866 | 13.0 | 6.0 | 0.45 |
| 1699997 | 172394 | 1 | Cash loans | M | Y | Y | 0 | 247500.0 | 509400.0 | 40374.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 3.0 | 0.60 | 0.485866 | 13.0 | 6.0 | 0.45 |
| 1699998 | 172394 | 1 | Cash loans | M | Y | Y | 0 | 247500.0 | 509400.0 | 40374.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 3.0 | 0.60 | 0.485866 | 13.0 | 6.0 | 0.45 |
| 1699999 | 172394 | 1 | Cash loans | M | Y | Y | 0 | 247500.0 | 509400.0 | 40374.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 3.0 | 0.60 | 0.485866 | 13.0 | 6.0 | 0.45 |
1700000 rows × 102 columns
debt_ovd = debt_ovd[:1500000]
train = train.merge(debt_ovd, on = 'SK_ID_CURR', how = 'left')
train = train[:1700000]
train
| SK_ID_CURR | TARGET | NAME_CONTRACT_TYPE | CODE_GENDER | FLAG_OWN_CAR | FLAG_OWN_REALTY | CNT_CHILDREN | AMT_INCOME_TOTAL | AMT_CREDIT | AMT_ANNUITY | ... | AMT_REQ_CREDIT_BUREAU_WEEK | AMT_REQ_CREDIT_BUREAU_MON | AMT_REQ_CREDIT_BUREAU_QRT | AMT_REQ_CREDIT_BUREAU_YEAR | ACTIVE_LOANS_PERCENTAGE | CREDIT_INCOME_RATIO | YEARS_TO_PAY | INCOME_ANNUITY | AVG_DPD | OVERDUE_DEBT_RATIO | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 100002 | 1 | Cash loans | M | N | Y | 0 | 202500.0 | 406597.5 | 24700.5 | ... | 0.0 | 0.0 | 0.0 | 1.0 | 0.250000 | 0.498036 | 16.0 | 8.0 | 0.000000 | 0.0 |
| 1 | 100002 | 1 | Cash loans | M | N | Y | 0 | 202500.0 | 406597.5 | 24700.5 | ... | 0.0 | 0.0 | 0.0 | 1.0 | 0.250000 | 0.498036 | 16.0 | 8.0 | 0.000000 | 0.0 |
| 2 | 100002 | 1 | Cash loans | M | N | Y | 0 | 202500.0 | 406597.5 | 24700.5 | ... | 0.0 | 0.0 | 0.0 | 1.0 | 0.250000 | 0.498036 | 16.0 | 8.0 | 0.000000 | 0.0 |
| 3 | 100002 | 1 | Cash loans | M | N | Y | 0 | 202500.0 | 406597.5 | 24700.5 | ... | 0.0 | 0.0 | 0.0 | 1.0 | 0.250000 | 0.498036 | 16.0 | 8.0 | 0.000000 | 0.0 |
| 4 | 100002 | 1 | Cash loans | M | N | Y | 0 | 202500.0 | 406597.5 | 24700.5 | ... | 0.0 | 0.0 | 0.0 | 1.0 | 0.250000 | 0.498036 | 16.0 | 8.0 | 0.000000 | 0.0 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 1699995 | 108558 | 0 | Cash loans | F | N | Y | 0 | 157500.0 | 1350000.0 | 71928.0 | ... | 0.0 | 0.0 | 0.0 | 2.0 | 0.333333 | 0.116667 | 19.0 | 2.0 | 0.049383 | 0.0 |
| 1699996 | 108558 | 0 | Cash loans | F | N | Y | 0 | 157500.0 | 1350000.0 | 71928.0 | ... | 0.0 | 0.0 | 0.0 | 2.0 | 0.333333 | 0.116667 | 19.0 | 2.0 | 0.049383 | 0.0 |
| 1699997 | 108558 | 0 | Cash loans | F | N | Y | 0 | 157500.0 | 1350000.0 | 71928.0 | ... | 0.0 | 0.0 | 0.0 | 2.0 | 0.333333 | 0.116667 | 19.0 | 2.0 | 0.049383 | 0.0 |
| 1699998 | 108558 | 0 | Cash loans | F | N | Y | 0 | 157500.0 | 1350000.0 | 71928.0 | ... | 0.0 | 0.0 | 0.0 | 2.0 | 0.333333 | 0.116667 | 19.0 | 2.0 | 0.049383 | 0.0 |
| 1699999 | 108558 | 0 | Cash loans | F | N | Y | 0 | 157500.0 | 1350000.0 | 71928.0 | ... | 0.0 | 0.0 | 0.0 | 2.0 | 0.333333 | 0.116667 | 19.0 | 2.0 | 0.049383 | 0.0 |
1700000 rows × 103 columns
train['OVERDUE_DEBT_RATIO'].value_counts()
0.000000 1579821
0.000188 8064
0.000342 4860
0.000290 3584
0.000137 2400
...
0.000027 1
0.000286 1
91.000000 1
0.013688 1
0.000699 1
Name: OVERDUE_DEBT_RATIO, Length: 79, dtype: int64
debt_ovd_model = model_logreg(train,'Debt Overdue of each Customer Feature')
X train shape: (1088000, 102) X validation shape: (272000, 102) X test shape: (340000, 102)
| SK_ID_CURR | CNT_CHILDREN | AMT_INCOME_TOTAL | AMT_CREDIT | AMT_ANNUITY | AMT_GOODS_PRICE | REGION_POPULATION_RELATIVE | DAYS_BIRTH | DAYS_EMPLOYED | DAYS_REGISTRATION | ... | HOUSETYPE_MODE_terraced house | WALLSMATERIAL_MODE_Block | WALLSMATERIAL_MODE_Mixed | WALLSMATERIAL_MODE_Monolithic | WALLSMATERIAL_MODE_Others | WALLSMATERIAL_MODE_Panel | WALLSMATERIAL_MODE_Stone, brick | WALLSMATERIAL_MODE_Wooden | EMERGENCYSTATE_MODE_No | EMERGENCYSTATE_MODE_Yes | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1.634258 | 0.799427 | -0.223311 | -0.568658 | -0.872830 | -0.438033 | 0.718918 | 0.072134 | -0.431180 | -1.103801 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 1.0 | 0.0 |
| 1 | 1.311380 | -0.568021 | -1.053014 | -0.661751 | -0.951324 | -0.538037 | -0.490435 | -0.396993 | 2.348837 | -1.366889 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 1.0 | 0.0 |
| 2 | 0.228227 | 0.799427 | -0.223311 | 0.838082 | 0.035543 | 1.073130 | -0.929557 | 0.746365 | -0.407353 | 0.553147 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 1.0 | 0.0 |
| 3 | -1.667291 | 2.166875 | -0.223311 | -0.568658 | -0.600944 | -0.438033 | -0.167891 | 1.247702 | -0.421397 | 1.054306 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 1.0 | 0.0 |
| 4 | -1.208422 | -0.568021 | 1.021244 | 2.088601 | 0.864284 | 1.995384 | 3.829119 | -1.787853 | 2.348837 | -1.443060 | ... | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 |
5 rows × 221 columns
No Skill: ROC AUC=0.500 Logistic: ROC AUC=0.951
| Pipeline | Dataset | TrainAcc | ValidAcc | TestAcc | TrainROC | TestROC | ValidROC | Feature Added | |
|---|---|---|---|---|---|---|---|---|---|
| 0 | Logistic Regression | HCDR | 97.04% | 97.03% | 97.00% | 0.9521 | 0.9505 | 0.9515 | Debt Overdue of each Customer Feature |
def model_rf(train):
X = train.drop(['TARGET'], axis = 1)
y = train["TARGET"]
# Split the provided training data into training and validationa and test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=42)
X_train, X_valid, y_train, y_valid = train_test_split(X_train, y_train, test_size=0.2, random_state=42)
print(f"X train shape: {X_train.shape}")
print(f"X validation shape: {X_valid.shape}")
print(f"X test shape: {X_test.shape}")
numerical_features = X_train.select_dtypes(include = ['int64', 'float64']).columns
categorical_features = X_train.select_dtypes(include = ['object', 'bool']).columns
#print(f"\nNumerical features : {list(numerical_features)}")
#print(f"\nCategorical features : {list(categorical_features)}")
num_pipeline = Pipeline([
('scaler', StandardScaler()),
('imputer', SimpleImputer(strategy = 'median'))
])
cat_pipeline = Pipeline([
('imputer', SimpleImputer(strategy='most_frequent')),
('ohe', OneHotEncoder(sparse=False, handle_unknown="ignore"))
])
data_pipeline = ColumnTransformer([
("num_pipeline", num_pipeline, numerical_features),
("cat_pipeline", cat_pipeline, categorical_features)], remainder = 'drop', n_jobs = -1)
RF = RandomForestClassifier(random_state = 42,n_estimators=20, criterion='gini', max_depth=6)
data_pipeline_rf = make_pipeline(data_pipeline, RF)
data_pipeline_rf.fit(X_train, y_train)
train_acc = data_pipeline_rf.score(X_train, y_train)
validAcc = data_pipeline_rf.score(X_valid, y_valid)
testAcc = data_pipeline_rf.score(X_test, y_test)
predictions = data_pipeline_rf.predict_proba(X_test)
print ("Score",roc_auc_score(y_test, predictions[:,1]))
fpr, tpr, _ = roc_curve(y_test, predictions[:,1])
plt.clf()
plt.plot(fpr, tpr)
plt.xlabel('FPR')
plt.ylabel('TPR')
plt.title('ROC curve')
plt.show()
train_roc = roc_auc_score(y_train, data_pipeline_rf.predict_proba(X_train)[:, 1])
test_roc = roc_auc_score(y_test, data_pipeline_rf.predict_proba(X_test)[:, 1])
valid_roc = roc_auc_score(y_valid, data_pipeline_rf.predict_proba(X_valid)[:, 1])
results.loc[len(results)] = ["Random Forest","HCDR",f"{train_acc*100:8.2f}%",
f"{validAcc*100:8.2f}%", f"{testAcc*100:8.2f}%",f"{np.round(train_roc,4)}",f"{np.round(test_roc,4)}",f"{np.round(valid_roc,4)}","Ensemble method"]
display(results)
return data_pipeline_rf,RF
debt_ovd_model_pipe, rf = model_rf(train)
X train shape: (1088000, 102) X validation shape: (272000, 102) X test shape: (340000, 102) Score 0.9796172080754798
| Pipeline | Dataset | TrainAcc | ValidAcc | TestAcc | TrainROC | TestROC | ValidROC | Feature Added | |
|---|---|---|---|---|---|---|---|---|---|
| 0 | Logistic Regression | HCDR | 97.04% | 97.03% | 97.00% | 0.9521 | 0.9505 | 0.9515 | Debt Overdue of each Customer Feature |
| 1 | Random Forest | HCDR | 96.77% | 96.78% | 96.76% | 0.9803 | 0.9796 | 0.9786 | Ensemble method |
train
| SK_ID_CURR | TARGET | NAME_CONTRACT_TYPE | CODE_GENDER | FLAG_OWN_CAR | FLAG_OWN_REALTY | CNT_CHILDREN | AMT_INCOME_TOTAL | AMT_CREDIT | AMT_ANNUITY | ... | AMT_REQ_CREDIT_BUREAU_WEEK | AMT_REQ_CREDIT_BUREAU_MON | AMT_REQ_CREDIT_BUREAU_QRT | AMT_REQ_CREDIT_BUREAU_YEAR | ACTIVE_LOANS_PERCENTAGE | CREDIT_INCOME_RATIO | YEARS_TO_PAY | INCOME_ANNUITY | AVG_DPD | OVERDUE_DEBT_RATIO | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 100002 | 1 | Cash loans | M | N | Y | 0 | 202500.0 | 406597.5 | 24700.5 | ... | 0.0 | 0.0 | 0.0 | 1.0 | 0.250000 | 0.498036 | 16.0 | 8.0 | 0.000000 | 0.0 |
| 1 | 100002 | 1 | Cash loans | M | N | Y | 0 | 202500.0 | 406597.5 | 24700.5 | ... | 0.0 | 0.0 | 0.0 | 1.0 | 0.250000 | 0.498036 | 16.0 | 8.0 | 0.000000 | 0.0 |
| 2 | 100002 | 1 | Cash loans | M | N | Y | 0 | 202500.0 | 406597.5 | 24700.5 | ... | 0.0 | 0.0 | 0.0 | 1.0 | 0.250000 | 0.498036 | 16.0 | 8.0 | 0.000000 | 0.0 |
| 3 | 100002 | 1 | Cash loans | M | N | Y | 0 | 202500.0 | 406597.5 | 24700.5 | ... | 0.0 | 0.0 | 0.0 | 1.0 | 0.250000 | 0.498036 | 16.0 | 8.0 | 0.000000 | 0.0 |
| 4 | 100002 | 1 | Cash loans | M | N | Y | 0 | 202500.0 | 406597.5 | 24700.5 | ... | 0.0 | 0.0 | 0.0 | 1.0 | 0.250000 | 0.498036 | 16.0 | 8.0 | 0.000000 | 0.0 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 1699995 | 108558 | 0 | Cash loans | F | N | Y | 0 | 157500.0 | 1350000.0 | 71928.0 | ... | 0.0 | 0.0 | 0.0 | 2.0 | 0.333333 | 0.116667 | 19.0 | 2.0 | 0.049383 | 0.0 |
| 1699996 | 108558 | 0 | Cash loans | F | N | Y | 0 | 157500.0 | 1350000.0 | 71928.0 | ... | 0.0 | 0.0 | 0.0 | 2.0 | 0.333333 | 0.116667 | 19.0 | 2.0 | 0.049383 | 0.0 |
| 1699997 | 108558 | 0 | Cash loans | F | N | Y | 0 | 157500.0 | 1350000.0 | 71928.0 | ... | 0.0 | 0.0 | 0.0 | 2.0 | 0.333333 | 0.116667 | 19.0 | 2.0 | 0.049383 | 0.0 |
| 1699998 | 108558 | 0 | Cash loans | F | N | Y | 0 | 157500.0 | 1350000.0 | 71928.0 | ... | 0.0 | 0.0 | 0.0 | 2.0 | 0.333333 | 0.116667 | 19.0 | 2.0 | 0.049383 | 0.0 |
| 1699999 | 108558 | 0 | Cash loans | F | N | Y | 0 | 157500.0 | 1350000.0 | 71928.0 | ... | 0.0 | 0.0 | 0.0 | 2.0 | 0.333333 | 0.116667 | 19.0 | 2.0 | 0.049383 | 0.0 |
1700000 rows × 103 columns
train_columns = ['ACTIVE_LOANS_PERCENTAGE','CREDIT_INCOME_RATIO','YEARS_TO_PAY','INCOME_ANNUITY','AVG_DPD','OVERDUE_DEBT_RATIO']
features = train_columns
importances = rf.feature_importances_
indices = np.argsort(importances[-6:])
importances[-6:]
array([0.00000000e+00, 2.55603290e-04, 6.82355186e-05, 0.00000000e+00,
0.00000000e+00, 0.00000000e+00])
plt.figure(figsize = (15, 8))
plt.title('Feature Importances')
plt.bar(range(len(indices)), importances[indices], color='b', align='center')
plt.xticks(range(len(indices)), [features[i] for i in indices])
plt.xlabel('Relative Importance')
plt.show();
train = pd.read_pickle('train2.pkl')
train
| SK_ID_CURR | TARGET | NAME_CONTRACT_TYPE | CODE_GENDER | FLAG_OWN_CAR | FLAG_OWN_REALTY | CNT_CHILDREN | AMT_INCOME_TOTAL | AMT_CREDIT | AMT_ANNUITY | ... | AMT_REQ_CREDIT_BUREAU_DAY | AMT_REQ_CREDIT_BUREAU_WEEK | AMT_REQ_CREDIT_BUREAU_MON | AMT_REQ_CREDIT_BUREAU_QRT | AMT_REQ_CREDIT_BUREAU_YEAR | ACTIVE_LOANS_PERCENTAGE | CREDIT_INCOME_RATIO | YEARS_TO_PAY | INCOME_ANNUITY | AVG_DPD | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 100002 | 1 | Cash loans | M | N | Y | 0 | 202500.0 | 406597.5 | 24700.5 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.25 | 0.498036 | 16.0 | 8.0 | 0.00 |
| 1 | 100002 | 1 | Cash loans | M | N | Y | 0 | 202500.0 | 406597.5 | 24700.5 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.25 | 0.498036 | 16.0 | 8.0 | 0.00 |
| 2 | 100002 | 1 | Cash loans | M | N | Y | 0 | 202500.0 | 406597.5 | 24700.5 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.25 | 0.498036 | 16.0 | 8.0 | 0.00 |
| 3 | 100002 | 1 | Cash loans | M | N | Y | 0 | 202500.0 | 406597.5 | 24700.5 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.25 | 0.498036 | 16.0 | 8.0 | 0.00 |
| 4 | 100002 | 1 | Cash loans | M | N | Y | 0 | 202500.0 | 406597.5 | 24700.5 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.25 | 0.498036 | 16.0 | 8.0 | 0.00 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 1699995 | 172394 | 1 | Cash loans | M | Y | Y | 0 | 247500.0 | 509400.0 | 40374.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 3.0 | 0.60 | 0.485866 | 13.0 | 6.0 | 0.45 |
| 1699996 | 172394 | 1 | Cash loans | M | Y | Y | 0 | 247500.0 | 509400.0 | 40374.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 3.0 | 0.60 | 0.485866 | 13.0 | 6.0 | 0.45 |
| 1699997 | 172394 | 1 | Cash loans | M | Y | Y | 0 | 247500.0 | 509400.0 | 40374.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 3.0 | 0.60 | 0.485866 | 13.0 | 6.0 | 0.45 |
| 1699998 | 172394 | 1 | Cash loans | M | Y | Y | 0 | 247500.0 | 509400.0 | 40374.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 3.0 | 0.60 | 0.485866 | 13.0 | 6.0 | 0.45 |
| 1699999 | 172394 | 1 | Cash loans | M | Y | Y | 0 | 247500.0 | 509400.0 | 40374.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 3.0 | 0.60 | 0.485866 | 13.0 | 6.0 | 0.45 |
1700000 rows × 102 columns
train.to_pickle('train3.pkl')
X = train.drop(['TARGET'], axis = 1)
y = train["TARGET"]
# Split the provided training data into training and validationa and test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=42)
X_train, X_valid, y_train, y_valid = train_test_split(X_train, y_train, test_size=0.2, random_state=42)
print(f"X train shape: {X_train.shape}")
print(f"X validation shape: {X_valid.shape}")
print(f"X test shape: {X_test.shape}")
X train shape: (1088000, 101) X validation shape: (272000, 101) X test shape: (340000, 101)
numerical_features = X_train.select_dtypes(include = ['int64', 'float64']).columns
categorical_features = X_train.select_dtypes(include = ['object', 'bool']).columns
num_pipeline = Pipeline([
('scaler', StandardScaler()),
('imputer', SimpleImputer(strategy = 'median'))
])
cat_pipeline = Pipeline([
('imputer', SimpleImputer(strategy='most_frequent')),
('ohe', OneHotEncoder(sparse=False, handle_unknown="ignore"))
])
data_pipeline = ColumnTransformer([
("num_pipeline", num_pipeline, numerical_features),
("cat_pipeline", cat_pipeline, categorical_features)], remainder = 'drop', n_jobs = -1)
X_train_transformed = data_pipeline.fit_transform(X_train)
column_names = list(numerical_features) + \
list(data_pipeline.transformers_[1][1].named_steps["ohe"].get_feature_names(categorical_features))
X_train_transformed_df = pd.DataFrame(X_train_transformed, columns=column_names)
X_train_transformed_df
| SK_ID_CURR | CNT_CHILDREN | AMT_INCOME_TOTAL | AMT_CREDIT | AMT_ANNUITY | AMT_GOODS_PRICE | REGION_POPULATION_RELATIVE | DAYS_BIRTH | DAYS_EMPLOYED | DAYS_REGISTRATION | ... | HOUSETYPE_MODE_terraced house | WALLSMATERIAL_MODE_Block | WALLSMATERIAL_MODE_Mixed | WALLSMATERIAL_MODE_Monolithic | WALLSMATERIAL_MODE_Others | WALLSMATERIAL_MODE_Panel | WALLSMATERIAL_MODE_Stone, brick | WALLSMATERIAL_MODE_Wooden | EMERGENCYSTATE_MODE_No | EMERGENCYSTATE_MODE_Yes | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1.630563 | 3.585626 | -0.407585 | -0.095343 | 0.075151 | -0.415306 | 0.648845 | 1.061650 | -0.413921 | 1.356249 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 1.0 | 0.0 |
| 1 | 1.312255 | -0.555794 | -0.166599 | 0.406231 | 0.618199 | 0.173682 | -1.037355 | 0.789896 | -0.438971 | -0.657580 | ... | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 |
| 2 | 0.219138 | 0.824679 | 0.194881 | -0.552245 | 0.477013 | -0.415306 | 0.648845 | 0.646010 | -0.417775 | 1.104379 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 1.0 | 0.0 |
| 3 | -1.725651 | 0.824679 | -0.287092 | -0.357542 | 0.265558 | -0.203270 | 0.461723 | 0.972604 | -0.422241 | -0.600221 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 1.0 | 0.0 |
| 4 | -1.193248 | -0.555794 | -0.166599 | 0.120300 | 0.776871 | 0.055885 | -1.193243 | 1.575295 | -0.433909 | 0.002054 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 1.0 | 0.0 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 1087995 | -0.306728 | 2.205153 | -0.166599 | -0.552245 | -0.462392 | -0.415306 | -1.057358 | -0.559915 | -0.413772 | 0.991931 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 1.0 | 0.0 |
| 1087996 | 1.043074 | -0.555794 | 0.797347 | 0.096633 | 0.328379 | 0.091224 | -0.242193 | -0.282189 | -0.418536 | 0.675886 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 1.0 | 0.0 |
| 1087997 | 1.186413 | -0.555794 | 0.194881 | -0.011405 | -0.190059 | 0.173682 | 0.302326 | -0.565074 | -0.423225 | 0.484215 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 1.0 | 0.0 |
| 1087998 | -1.327695 | -0.555794 | -0.431684 | -0.882114 | -1.128169 | -0.886496 | -0.168145 | -0.599280 | -0.449468 | -0.379301 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 1.0 | 0.0 |
| 1087999 | 1.666854 | 0.824679 | -0.648572 | 0.096763 | -0.566663 | 0.291480 | 0.177251 | 0.808900 | -0.437778 | -0.215458 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 1.0 | 0.0 |
1088000 rows × 225 columns
X_test_transformed = data_pipeline.fit_transform(X_test)
column_names = list(numerical_features) + \
list(data_pipeline.transformers_[1][1].named_steps["ohe"].get_feature_names(categorical_features))
X_test_transformed_df = pd.DataFrame(X_test_transformed, columns=column_names)
X_test_transformed_df
| SK_ID_CURR | CNT_CHILDREN | AMT_INCOME_TOTAL | AMT_CREDIT | AMT_ANNUITY | AMT_GOODS_PRICE | REGION_POPULATION_RELATIVE | DAYS_BIRTH | DAYS_EMPLOYED | DAYS_REGISTRATION | ... | HOUSETYPE_MODE_terraced house | WALLSMATERIAL_MODE_Block | WALLSMATERIAL_MODE_Mixed | WALLSMATERIAL_MODE_Monolithic | WALLSMATERIAL_MODE_Others | WALLSMATERIAL_MODE_Panel | WALLSMATERIAL_MODE_Stone, brick | WALLSMATERIAL_MODE_Wooden | EMERGENCYSTATE_MODE_No | EMERGENCYSTATE_MODE_Yes | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | -0.447151 | -0.556897 | 0.869128 | 1.011823 | 1.281185 | 0.876913 | 0.614860 | -0.487535 | -0.416764 | -0.598926 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 1.0 | 0.0 |
| 1 | -0.391754 | -0.556897 | -0.030261 | -0.676659 | 0.929630 | -0.650743 | 0.742054 | 0.956287 | -0.418171 | -0.056029 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 1.0 | 0.0 |
| 2 | 0.329466 | -0.556897 | 0.434424 | 1.949191 | 1.178944 | 1.969774 | 0.221096 | -0.607175 | -0.425973 | 0.166466 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 1.0 | 0.0 |
| 3 | -1.330138 | -0.556897 | 0.239556 | -0.012484 | 0.404555 | 0.171841 | 0.467524 | 0.870830 | -0.446416 | 0.529722 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 1.0 | 0.0 |
| 4 | 1.535746 | -0.556897 | -0.255108 | -1.264610 | -0.998764 | -1.238303 | 1.704035 | 0.284837 | -0.412453 | -1.165379 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 1.0 | 0.0 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 339995 | 0.036900 | -0.556897 | -0.180159 | 0.801808 | 0.337469 | 0.782903 | -0.205911 | -1.391755 | 2.309188 | 1.234381 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 1.0 | 0.0 |
| 339996 | 0.655356 | -0.556897 | 0.569332 | 0.865105 | 0.106539 | 0.782903 | 0.182078 | -0.753673 | -0.448449 | -1.200285 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 1.0 | 0.0 |
| 339997 | 1.709535 | -0.556897 | -0.434986 | -0.558115 | -0.428212 | -0.615489 | -0.125271 | -1.769395 | 2.309188 | 0.210170 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 1.0 | 0.0 |
| 339998 | 1.169414 | -0.556897 | -0.434986 | -0.210165 | 0.238130 | -0.415719 | -1.400869 | -0.042072 | -0.456705 | -0.627022 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 1.0 | 0.0 |
| 339999 | 1.079105 | 0.824931 | 0.569332 | 1.606087 | 0.754175 | 1.934520 | 0.969608 | 0.295146 | -0.437885 | 1.351872 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 1.0 | 0.0 |
340000 rows × 222 columns
set(X_train_transformed_df.columns).difference(set(X_test_transformed_df.columns))
{'NAME_FAMILY_STATUS_Unknown',
'NAME_INCOME_TYPE_Businessman',
'NAME_INCOME_TYPE_Maternity leave'}
X_test_transformed_df['NAME_FAMILY_STATUS_Unknown'] = 0
X_test_transformed_df['NAME_INCOME_TYPE_Businessman'] = 0
X_test_transformed_df['NAME_INCOME_TYPE_Maternity leave'] = 0
X_train_transformed_df.shape
(1088000, 225)
X_test_transformed_df.shape
(340000, 225)
X_test_transformed = X_test_transformed_df.to_numpy()
X_test_transformed.shape
(340000, 225)
param_grid_lr = {
'C': [10**x for x in range(-3,4)],
'penalty': ['l1','l2']
}
grid_search_lr = GridSearchCV(LogisticRegression(), param_grid = param_grid_lr, scoring='accuracy', cv=2, refit = True)
grid_search_model1 = grid_search_lr.fit(X_train_transformed, y_train)
grid_search_lr.best_params_
{'C': 100, 'penalty': 'l2'}
grid_search_lr.best_estimator_.fit(X_train_transformed, y_train)
LogisticRegression(C=100)
best_train_accuracy = grid_search_lr.best_estimator_.score(X_train_transformed, y_train)
best_train_accuracy*100
97.09981617647058
best_test_accuracy = grid_search_lr.best_estimator_.score(X_test_transformed, y_test)
best_test_accuracy*100
92.4614705882353
After performing Hyperparameter tuning for logistic regression we are getting an improved test accuracy of 92.46%.
param_grid = {
'n_estimators': [80, 100,120],
'max_depth' : [3, 4],
'criterion' :['gini','entropy']
}
grid_search = GridSearchCV(RandomForestClassifier(), param_grid = param_grid, scoring='accuracy', cv=2, refit = True)
grid_search.fit(X_train_transformed, y_train)
GridSearchCV(cv=2, estimator=RandomForestClassifier(),
param_grid={'criterion': ['gini', 'entropy'], 'max_depth': [3, 4],
'n_estimators': [80, 100, 120]},
scoring='accuracy')
grid_search.best_params_
{'criterion': 'gini', 'max_depth': 4, 'n_estimators': 80}
grid_search.best_estimator_.fit(X_train_transformed, y_train)
RandomForestClassifier(max_depth=4, n_estimators=80)
best_train_accuracy = grid_search.best_estimator_.score(X_train_transformed, y_train)
best_train_accuracy*100
94.196875
best_test_accuracy = grid_search.best_estimator_.score(X_test_transformed, y_test)
best_test_accuracy*100
92.975
After performing Hyperparameter tuning for random forest classifier we are getting an improved test accuracy of 92.975%.
import torch
import torch.nn as nn
import torch.optim
import torch.nn.functional as F
from torch.utils.data import DataLoader
from tensorflow.keras.callbacks import TensorBoard
import torchvision
from torch.utils.tensorboard import SummaryWriter
writer = SummaryWriter("runs/")
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
train = datasets['application_train']
train
| SK_ID_CURR | TARGET | NAME_CONTRACT_TYPE | CODE_GENDER | FLAG_OWN_CAR | FLAG_OWN_REALTY | CNT_CHILDREN | AMT_INCOME_TOTAL | AMT_CREDIT | AMT_ANNUITY | ... | FLAG_DOCUMENT_18 | FLAG_DOCUMENT_19 | FLAG_DOCUMENT_20 | FLAG_DOCUMENT_21 | AMT_REQ_CREDIT_BUREAU_HOUR | AMT_REQ_CREDIT_BUREAU_DAY | AMT_REQ_CREDIT_BUREAU_WEEK | AMT_REQ_CREDIT_BUREAU_MON | AMT_REQ_CREDIT_BUREAU_QRT | AMT_REQ_CREDIT_BUREAU_YEAR | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 100002 | 1 | Cash loans | M | N | Y | 0 | 202500.0 | 406597.5 | 24700.5 | ... | 0 | 0 | 0 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 |
| 1 | 100003 | 0 | Cash loans | F | N | N | 0 | 270000.0 | 1293502.5 | 35698.5 | ... | 0 | 0 | 0 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
| 2 | 100004 | 0 | Revolving loans | M | Y | Y | 0 | 67500.0 | 135000.0 | 6750.0 | ... | 0 | 0 | 0 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
| 3 | 100006 | 0 | Cash loans | F | N | Y | 0 | 135000.0 | 312682.5 | 29686.5 | ... | 0 | 0 | 0 | 0 | NaN | NaN | NaN | NaN | NaN | NaN |
| 4 | 100007 | 0 | Cash loans | M | N | Y | 0 | 121500.0 | 513000.0 | 21865.5 | ... | 0 | 0 | 0 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 307506 | 456251 | 0 | Cash loans | M | N | N | 0 | 157500.0 | 254700.0 | 27558.0 | ... | 0 | 0 | 0 | 0 | NaN | NaN | NaN | NaN | NaN | NaN |
| 307507 | 456252 | 0 | Cash loans | F | N | Y | 0 | 72000.0 | 269550.0 | 12001.5 | ... | 0 | 0 | 0 | 0 | NaN | NaN | NaN | NaN | NaN | NaN |
| 307508 | 456253 | 0 | Cash loans | F | N | Y | 0 | 153000.0 | 677664.0 | 29979.0 | ... | 0 | 0 | 0 | 0 | 1.0 | 0.0 | 0.0 | 1.0 | 0.0 | 1.0 |
| 307509 | 456254 | 1 | Cash loans | F | N | Y | 0 | 171000.0 | 370107.0 | 20205.0 | ... | 0 | 0 | 0 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
| 307510 | 456255 | 0 | Cash loans | F | N | N | 0 | 157500.0 | 675000.0 | 49117.5 | ... | 0 | 0 | 0 | 0 | 0.0 | 0.0 | 0.0 | 2.0 | 0.0 | 1.0 |
307511 rows × 117 columns
train.drop(['DAYS_LAST_PHONE_CHANGE','OBS_30_CNT_SOCIAL_CIRCLE','OBS_60_CNT_SOCIAL_CIRCLE','DEF_30_CNT_SOCIAL_CIRCLE','DEF_60_CNT_SOCIAL_CIRCLE'], axis = 1, inplace = True)
train = train.loc[:, ~train.columns.str.startswith("FLAG_DOCUMENT_")]
train = train.loc[:, ~train.columns.str.endswith("MODE")]
train = train.loc[:, ~train.columns.str.endswith("MEDI")]
train = train.loc[:, ~train.columns.str.endswith("AVG")]
import missingno as msno
msno.bar(train)
plt.show()
train
| SK_ID_CURR | TARGET | NAME_CONTRACT_TYPE | CODE_GENDER | FLAG_OWN_CAR | FLAG_OWN_REALTY | CNT_CHILDREN | AMT_INCOME_TOTAL | AMT_CREDIT | AMT_ANNUITY | ... | ORGANIZATION_TYPE | EXT_SOURCE_1 | EXT_SOURCE_2 | EXT_SOURCE_3 | AMT_REQ_CREDIT_BUREAU_HOUR | AMT_REQ_CREDIT_BUREAU_DAY | AMT_REQ_CREDIT_BUREAU_WEEK | AMT_REQ_CREDIT_BUREAU_MON | AMT_REQ_CREDIT_BUREAU_QRT | AMT_REQ_CREDIT_BUREAU_YEAR | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 100002 | 1 | Cash loans | M | N | Y | 0 | 202500.0 | 406597.5 | 24700.5 | ... | Business Entity Type 3 | 0.083037 | 0.262949 | 0.139376 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 |
| 1 | 100003 | 0 | Cash loans | F | N | N | 0 | 270000.0 | 1293502.5 | 35698.5 | ... | School | 0.311267 | 0.622246 | NaN | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
| 2 | 100004 | 0 | Revolving loans | M | Y | Y | 0 | 67500.0 | 135000.0 | 6750.0 | ... | Government | NaN | 0.555912 | 0.729567 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
| 3 | 100006 | 0 | Cash loans | F | N | Y | 0 | 135000.0 | 312682.5 | 29686.5 | ... | Business Entity Type 3 | NaN | 0.650442 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 4 | 100007 | 0 | Cash loans | M | N | Y | 0 | 121500.0 | 513000.0 | 21865.5 | ... | Religion | NaN | 0.322738 | NaN | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 307506 | 456251 | 0 | Cash loans | M | N | N | 0 | 157500.0 | 254700.0 | 27558.0 | ... | Services | 0.145570 | 0.681632 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 307507 | 456252 | 0 | Cash loans | F | N | Y | 0 | 72000.0 | 269550.0 | 12001.5 | ... | XNA | NaN | 0.115992 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 307508 | 456253 | 0 | Cash loans | F | N | Y | 0 | 153000.0 | 677664.0 | 29979.0 | ... | School | 0.744026 | 0.535722 | 0.218859 | 1.0 | 0.0 | 0.0 | 1.0 | 0.0 | 1.0 |
| 307509 | 456254 | 1 | Cash loans | F | N | Y | 0 | 171000.0 | 370107.0 | 20205.0 | ... | Business Entity Type 1 | NaN | 0.514163 | 0.661024 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
| 307510 | 456255 | 0 | Cash loans | F | N | N | 0 | 157500.0 | 675000.0 | 49117.5 | ... | Business Entity Type 3 | 0.734460 | 0.708569 | 0.113922 | 0.0 | 0.0 | 0.0 | 2.0 | 0.0 | 1.0 |
307511 rows × 50 columns
numerical_features = train.select_dtypes(include = ['int64', 'float64']).columns
categorical_features = train.select_dtypes(include = ['object', 'bool']).columns
print(f"\nNumerical features : {list(numerical_features)}")
print(f"\nCategorical features : {list(categorical_features)}")
Numerical features : ['SK_ID_CURR', 'TARGET', 'CNT_CHILDREN', 'AMT_INCOME_TOTAL', 'AMT_CREDIT', 'AMT_ANNUITY', 'AMT_GOODS_PRICE', 'REGION_POPULATION_RELATIVE', 'DAYS_BIRTH', 'DAYS_EMPLOYED', 'DAYS_REGISTRATION', 'DAYS_ID_PUBLISH', 'OWN_CAR_AGE', 'FLAG_MOBIL', 'FLAG_EMP_PHONE', 'FLAG_WORK_PHONE', 'FLAG_CONT_MOBILE', 'FLAG_PHONE', 'FLAG_EMAIL', 'CNT_FAM_MEMBERS', 'REGION_RATING_CLIENT', 'REGION_RATING_CLIENT_W_CITY', 'HOUR_APPR_PROCESS_START', 'REG_REGION_NOT_LIVE_REGION', 'REG_REGION_NOT_WORK_REGION', 'LIVE_REGION_NOT_WORK_REGION', 'REG_CITY_NOT_LIVE_CITY', 'REG_CITY_NOT_WORK_CITY', 'LIVE_CITY_NOT_WORK_CITY', 'EXT_SOURCE_1', 'EXT_SOURCE_2', 'EXT_SOURCE_3', 'AMT_REQ_CREDIT_BUREAU_HOUR', 'AMT_REQ_CREDIT_BUREAU_DAY', 'AMT_REQ_CREDIT_BUREAU_WEEK', 'AMT_REQ_CREDIT_BUREAU_MON', 'AMT_REQ_CREDIT_BUREAU_QRT', 'AMT_REQ_CREDIT_BUREAU_YEAR'] Categorical features : ['NAME_CONTRACT_TYPE', 'CODE_GENDER', 'FLAG_OWN_CAR', 'FLAG_OWN_REALTY', 'NAME_TYPE_SUITE', 'NAME_INCOME_TYPE', 'NAME_EDUCATION_TYPE', 'NAME_FAMILY_STATUS', 'NAME_HOUSING_TYPE', 'OCCUPATION_TYPE', 'WEEKDAY_APPR_PROCESS_START', 'ORGANIZATION_TYPE']
correlations = train.corr()['TARGET'].sort_values()
print('Most Positive Correlations:\n', correlations.tail(10))
print('\nMost Negative Correlations:\n', correlations.head(10))
Most Positive Correlations: OWN_CAR_AGE 0.037612 DAYS_REGISTRATION 0.041975 REG_CITY_NOT_LIVE_CITY 0.044395 FLAG_EMP_PHONE 0.045982 REG_CITY_NOT_WORK_CITY 0.050994 DAYS_ID_PUBLISH 0.051457 REGION_RATING_CLIENT 0.058899 REGION_RATING_CLIENT_W_CITY 0.060893 DAYS_BIRTH 0.078239 TARGET 1.000000 Name: TARGET, dtype: float64 Most Negative Correlations: EXT_SOURCE_3 -0.178919 EXT_SOURCE_2 -0.160472 EXT_SOURCE_1 -0.155317 DAYS_EMPLOYED -0.044932 AMT_GOODS_PRICE -0.039645 REGION_POPULATION_RELATIVE -0.037227 AMT_CREDIT -0.030369 HOUR_APPR_PROCESS_START -0.024166 FLAG_PHONE -0.023806 AMT_ANNUITY -0.012817 Name: TARGET, dtype: float64
correlations = pd.DataFrame(correlations, columns = ['TARGET'])
correlations
| TARGET | |
|---|---|
| EXT_SOURCE_3 | -0.178919 |
| EXT_SOURCE_2 | -0.160472 |
| EXT_SOURCE_1 | -0.155317 |
| DAYS_EMPLOYED | -0.044932 |
| AMT_GOODS_PRICE | -0.039645 |
| REGION_POPULATION_RELATIVE | -0.037227 |
| AMT_CREDIT | -0.030369 |
| HOUR_APPR_PROCESS_START | -0.024166 |
| FLAG_PHONE | -0.023806 |
| AMT_ANNUITY | -0.012817 |
| AMT_REQ_CREDIT_BUREAU_MON | -0.012462 |
| AMT_INCOME_TOTAL | -0.003982 |
| SK_ID_CURR | -0.002108 |
| AMT_REQ_CREDIT_BUREAU_QRT | -0.002022 |
| FLAG_EMAIL | -0.001758 |
| FLAG_CONT_MOBILE | 0.000370 |
| FLAG_MOBIL | 0.000534 |
| AMT_REQ_CREDIT_BUREAU_WEEK | 0.000788 |
| AMT_REQ_CREDIT_BUREAU_HOUR | 0.000930 |
| AMT_REQ_CREDIT_BUREAU_DAY | 0.002704 |
| LIVE_REGION_NOT_WORK_REGION | 0.002819 |
| REG_REGION_NOT_LIVE_REGION | 0.005576 |
| REG_REGION_NOT_WORK_REGION | 0.006942 |
| CNT_FAM_MEMBERS | 0.009308 |
| CNT_CHILDREN | 0.019187 |
| AMT_REQ_CREDIT_BUREAU_YEAR | 0.019930 |
| FLAG_WORK_PHONE | 0.028524 |
| LIVE_CITY_NOT_WORK_CITY | 0.032518 |
| OWN_CAR_AGE | 0.037612 |
| DAYS_REGISTRATION | 0.041975 |
| REG_CITY_NOT_LIVE_CITY | 0.044395 |
| FLAG_EMP_PHONE | 0.045982 |
| REG_CITY_NOT_WORK_CITY | 0.050994 |
| DAYS_ID_PUBLISH | 0.051457 |
| REGION_RATING_CLIENT | 0.058899 |
| REGION_RATING_CLIENT_W_CITY | 0.060893 |
| DAYS_BIRTH | 0.078239 |
| TARGET | 1.000000 |
correlations["abs_Target"] = np.abs(correlations["TARGET"])
correlations.sort_values("abs_Target", ascending = False, inplace = True)
correlations
| TARGET | abs_Target | |
|---|---|---|
| TARGET | 1.000000 | 1.000000 |
| EXT_SOURCE_3 | -0.178919 | 0.178919 |
| EXT_SOURCE_2 | -0.160472 | 0.160472 |
| EXT_SOURCE_1 | -0.155317 | 0.155317 |
| DAYS_BIRTH | 0.078239 | 0.078239 |
| REGION_RATING_CLIENT_W_CITY | 0.060893 | 0.060893 |
| REGION_RATING_CLIENT | 0.058899 | 0.058899 |
| DAYS_ID_PUBLISH | 0.051457 | 0.051457 |
| REG_CITY_NOT_WORK_CITY | 0.050994 | 0.050994 |
| FLAG_EMP_PHONE | 0.045982 | 0.045982 |
| DAYS_EMPLOYED | -0.044932 | 0.044932 |
| REG_CITY_NOT_LIVE_CITY | 0.044395 | 0.044395 |
| DAYS_REGISTRATION | 0.041975 | 0.041975 |
| AMT_GOODS_PRICE | -0.039645 | 0.039645 |
| OWN_CAR_AGE | 0.037612 | 0.037612 |
| REGION_POPULATION_RELATIVE | -0.037227 | 0.037227 |
| LIVE_CITY_NOT_WORK_CITY | 0.032518 | 0.032518 |
| AMT_CREDIT | -0.030369 | 0.030369 |
| FLAG_WORK_PHONE | 0.028524 | 0.028524 |
| HOUR_APPR_PROCESS_START | -0.024166 | 0.024166 |
| FLAG_PHONE | -0.023806 | 0.023806 |
| AMT_REQ_CREDIT_BUREAU_YEAR | 0.019930 | 0.019930 |
| CNT_CHILDREN | 0.019187 | 0.019187 |
| AMT_ANNUITY | -0.012817 | 0.012817 |
| AMT_REQ_CREDIT_BUREAU_MON | -0.012462 | 0.012462 |
| CNT_FAM_MEMBERS | 0.009308 | 0.009308 |
| REG_REGION_NOT_WORK_REGION | 0.006942 | 0.006942 |
| REG_REGION_NOT_LIVE_REGION | 0.005576 | 0.005576 |
| AMT_INCOME_TOTAL | -0.003982 | 0.003982 |
| LIVE_REGION_NOT_WORK_REGION | 0.002819 | 0.002819 |
| AMT_REQ_CREDIT_BUREAU_DAY | 0.002704 | 0.002704 |
| SK_ID_CURR | -0.002108 | 0.002108 |
| AMT_REQ_CREDIT_BUREAU_QRT | -0.002022 | 0.002022 |
| FLAG_EMAIL | -0.001758 | 0.001758 |
| AMT_REQ_CREDIT_BUREAU_HOUR | 0.000930 | 0.000930 |
| AMT_REQ_CREDIT_BUREAU_WEEK | 0.000788 | 0.000788 |
| FLAG_MOBIL | 0.000534 | 0.000534 |
| FLAG_CONT_MOBILE | 0.000370 | 0.000370 |
train_f = train[['TARGET','EXT_SOURCE_3','EXT_SOURCE_2','EXT_SOURCE_1','DAYS_BIRTH','REGION_RATING_CLIENT_W_CITY','REGION_RATING_CLIENT','DAYS_ID_PUBLISH','REG_CITY_NOT_WORK_CITY','FLAG_EMP_PHONE','DAYS_EMPLOYED']]
train_f
| TARGET | EXT_SOURCE_3 | EXT_SOURCE_2 | EXT_SOURCE_1 | DAYS_BIRTH | REGION_RATING_CLIENT_W_CITY | REGION_RATING_CLIENT | DAYS_ID_PUBLISH | REG_CITY_NOT_WORK_CITY | FLAG_EMP_PHONE | DAYS_EMPLOYED | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | 0.139376 | 0.262949 | 0.083037 | -9461 | 2 | 2 | -2120 | 0 | 1 | -637 |
| 1 | 0 | NaN | 0.622246 | 0.311267 | -16765 | 1 | 1 | -291 | 0 | 1 | -1188 |
| 2 | 0 | 0.729567 | 0.555912 | NaN | -19046 | 2 | 2 | -2531 | 0 | 1 | -225 |
| 3 | 0 | NaN | 0.650442 | NaN | -19005 | 2 | 2 | -2437 | 0 | 1 | -3039 |
| 4 | 0 | NaN | 0.322738 | NaN | -19932 | 2 | 2 | -3458 | 1 | 1 | -3038 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 307506 | 0 | NaN | 0.681632 | 0.145570 | -9327 | 1 | 1 | -1982 | 0 | 1 | -236 |
| 307507 | 0 | NaN | 0.115992 | NaN | -20775 | 2 | 2 | -4090 | 0 | 0 | 365243 |
| 307508 | 0 | 0.218859 | 0.535722 | 0.744026 | -14966 | 3 | 3 | -5150 | 1 | 1 | -7921 |
| 307509 | 1 | 0.661024 | 0.514163 | NaN | -11961 | 2 | 2 | -931 | 1 | 1 | -4786 |
| 307510 | 0 | 0.113922 | 0.708569 | 0.734460 | -16856 | 1 | 1 | -410 | 1 | 1 | -1262 |
307511 rows × 11 columns
train_f['TARGET'].value_counts()
0 282686 1 24825 Name: TARGET, dtype: int64
train_f['TARGET'].describe()
count 307511.000000 mean 0.080729 std 0.272419 min 0.000000 25% 0.000000 50% 0.000000 75% 0.000000 max 1.000000 Name: TARGET, dtype: float64
train_features = pd.read_pickle('train3.pkl')
train_features
| SK_ID_CURR | TARGET | NAME_CONTRACT_TYPE | CODE_GENDER | FLAG_OWN_CAR | FLAG_OWN_REALTY | CNT_CHILDREN | AMT_INCOME_TOTAL | AMT_CREDIT | AMT_ANNUITY | ... | AMT_REQ_CREDIT_BUREAU_WEEK | AMT_REQ_CREDIT_BUREAU_MON | AMT_REQ_CREDIT_BUREAU_QRT | AMT_REQ_CREDIT_BUREAU_YEAR | ACTIVE_LOANS_PERCENTAGE | CREDIT_INCOME_RATIO | YEARS_TO_PAY | INCOME_ANNUITY | AVG_DPD | OVERDUE_DEBT_RATIO | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 100002 | 1 | Cash loans | M | N | Y | 0 | 202500.0 | 406597.5 | 24700.5 | ... | 0.0 | 0.0 | 0.0 | 1.0 | 0.250000 | 0.498036 | 16.0 | 8.0 | 0.000000 | 0.0 |
| 1 | 100002 | 1 | Cash loans | M | N | Y | 0 | 202500.0 | 406597.5 | 24700.5 | ... | 0.0 | 0.0 | 0.0 | 1.0 | 0.250000 | 0.498036 | 16.0 | 8.0 | 0.000000 | 0.0 |
| 2 | 100002 | 1 | Cash loans | M | N | Y | 0 | 202500.0 | 406597.5 | 24700.5 | ... | 0.0 | 0.0 | 0.0 | 1.0 | 0.250000 | 0.498036 | 16.0 | 8.0 | 0.000000 | 0.0 |
| 3 | 100002 | 1 | Cash loans | M | N | Y | 0 | 202500.0 | 406597.5 | 24700.5 | ... | 0.0 | 0.0 | 0.0 | 1.0 | 0.250000 | 0.498036 | 16.0 | 8.0 | 0.000000 | 0.0 |
| 4 | 100002 | 1 | Cash loans | M | N | Y | 0 | 202500.0 | 406597.5 | 24700.5 | ... | 0.0 | 0.0 | 0.0 | 1.0 | 0.250000 | 0.498036 | 16.0 | 8.0 | 0.000000 | 0.0 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 1699995 | 108558 | 0 | Cash loans | F | N | Y | 0 | 157500.0 | 1350000.0 | 71928.0 | ... | 0.0 | 0.0 | 0.0 | 2.0 | 0.333333 | 0.116667 | 19.0 | 2.0 | 0.049383 | 0.0 |
| 1699996 | 108558 | 0 | Cash loans | F | N | Y | 0 | 157500.0 | 1350000.0 | 71928.0 | ... | 0.0 | 0.0 | 0.0 | 2.0 | 0.333333 | 0.116667 | 19.0 | 2.0 | 0.049383 | 0.0 |
| 1699997 | 108558 | 0 | Cash loans | F | N | Y | 0 | 157500.0 | 1350000.0 | 71928.0 | ... | 0.0 | 0.0 | 0.0 | 2.0 | 0.333333 | 0.116667 | 19.0 | 2.0 | 0.049383 | 0.0 |
| 1699998 | 108558 | 0 | Cash loans | F | N | Y | 0 | 157500.0 | 1350000.0 | 71928.0 | ... | 0.0 | 0.0 | 0.0 | 2.0 | 0.333333 | 0.116667 | 19.0 | 2.0 | 0.049383 | 0.0 |
| 1699999 | 108558 | 0 | Cash loans | F | N | Y | 0 | 157500.0 | 1350000.0 | 71928.0 | ... | 0.0 | 0.0 | 0.0 | 2.0 | 0.333333 | 0.116667 | 19.0 | 2.0 | 0.049383 | 0.0 |
1700000 rows × 103 columns
train_features['TARGET'].value_counts()
0 1579643 1 120357 Name: TARGET, dtype: int64
train_zero = train_features[train_features['TARGET'] == 0]
train_zero
| SK_ID_CURR | TARGET | NAME_CONTRACT_TYPE | CODE_GENDER | FLAG_OWN_CAR | FLAG_OWN_REALTY | CNT_CHILDREN | AMT_INCOME_TOTAL | AMT_CREDIT | AMT_ANNUITY | ... | AMT_REQ_CREDIT_BUREAU_WEEK | AMT_REQ_CREDIT_BUREAU_MON | AMT_REQ_CREDIT_BUREAU_QRT | AMT_REQ_CREDIT_BUREAU_YEAR | ACTIVE_LOANS_PERCENTAGE | CREDIT_INCOME_RATIO | YEARS_TO_PAY | INCOME_ANNUITY | AVG_DPD | OVERDUE_DEBT_RATIO | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 64 | 100003 | 0 | Cash loans | F | N | N | 0 | 270000.0 | 1293502.5 | 35698.5 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.250000 | 0.208736 | 36.0 | 8.0 | 0.000000 | NaN |
| 65 | 100003 | 0 | Cash loans | F | N | N | 0 | 270000.0 | 1293502.5 | 35698.5 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.250000 | 0.208736 | 36.0 | 8.0 | 0.000000 | NaN |
| 66 | 100003 | 0 | Cash loans | F | N | N | 0 | 270000.0 | 1293502.5 | 35698.5 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.250000 | 0.208736 | 36.0 | 8.0 | 0.000000 | NaN |
| 67 | 100003 | 0 | Cash loans | F | N | N | 0 | 270000.0 | 1293502.5 | 35698.5 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.250000 | 0.208736 | 36.0 | 8.0 | 0.000000 | NaN |
| 68 | 100003 | 0 | Cash loans | F | N | N | 0 | 270000.0 | 1293502.5 | 35698.5 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.250000 | 0.208736 | 36.0 | 8.0 | 0.000000 | NaN |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 1699995 | 108558 | 0 | Cash loans | F | N | Y | 0 | 157500.0 | 1350000.0 | 71928.0 | ... | 0.0 | 0.0 | 0.0 | 2.0 | 0.333333 | 0.116667 | 19.0 | 2.0 | 0.049383 | 0.0 |
| 1699996 | 108558 | 0 | Cash loans | F | N | Y | 0 | 157500.0 | 1350000.0 | 71928.0 | ... | 0.0 | 0.0 | 0.0 | 2.0 | 0.333333 | 0.116667 | 19.0 | 2.0 | 0.049383 | 0.0 |
| 1699997 | 108558 | 0 | Cash loans | F | N | Y | 0 | 157500.0 | 1350000.0 | 71928.0 | ... | 0.0 | 0.0 | 0.0 | 2.0 | 0.333333 | 0.116667 | 19.0 | 2.0 | 0.049383 | 0.0 |
| 1699998 | 108558 | 0 | Cash loans | F | N | Y | 0 | 157500.0 | 1350000.0 | 71928.0 | ... | 0.0 | 0.0 | 0.0 | 2.0 | 0.333333 | 0.116667 | 19.0 | 2.0 | 0.049383 | 0.0 |
| 1699999 | 108558 | 0 | Cash loans | F | N | Y | 0 | 157500.0 | 1350000.0 | 71928.0 | ... | 0.0 | 0.0 | 0.0 | 2.0 | 0.333333 | 0.116667 | 19.0 | 2.0 | 0.049383 | 0.0 |
1579643 rows × 103 columns
train_zero = train_zero[:120357]
train_zero
| SK_ID_CURR | TARGET | NAME_CONTRACT_TYPE | CODE_GENDER | FLAG_OWN_CAR | FLAG_OWN_REALTY | CNT_CHILDREN | AMT_INCOME_TOTAL | AMT_CREDIT | AMT_ANNUITY | ... | AMT_REQ_CREDIT_BUREAU_WEEK | AMT_REQ_CREDIT_BUREAU_MON | AMT_REQ_CREDIT_BUREAU_QRT | AMT_REQ_CREDIT_BUREAU_YEAR | ACTIVE_LOANS_PERCENTAGE | CREDIT_INCOME_RATIO | YEARS_TO_PAY | INCOME_ANNUITY | AVG_DPD | OVERDUE_DEBT_RATIO | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 64 | 100003 | 0 | Cash loans | F | N | N | 0 | 270000.0 | 1293502.5 | 35698.5 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.250000 | 0.208736 | 36.0 | 8.0 | 0.000000 | NaN |
| 65 | 100003 | 0 | Cash loans | F | N | N | 0 | 270000.0 | 1293502.5 | 35698.5 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.250000 | 0.208736 | 36.0 | 8.0 | 0.000000 | NaN |
| 66 | 100003 | 0 | Cash loans | F | N | N | 0 | 270000.0 | 1293502.5 | 35698.5 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.250000 | 0.208736 | 36.0 | 8.0 | 0.000000 | NaN |
| 67 | 100003 | 0 | Cash loans | F | N | N | 0 | 270000.0 | 1293502.5 | 35698.5 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.250000 | 0.208736 | 36.0 | 8.0 | 0.000000 | NaN |
| 68 | 100003 | 0 | Cash loans | F | N | N | 0 | 270000.0 | 1293502.5 | 35698.5 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.250000 | 0.208736 | 36.0 | 8.0 | 0.000000 | NaN |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 126015 | 100504 | 0 | Cash loans | F | N | Y | 0 | 157500.0 | 187704.0 | 12672.0 | ... | 0.0 | 5.0 | 0.0 | 1.0 | 0.818182 | 0.839087 | 15.0 | 12.0 | 0.150538 | 0.0 |
| 126016 | 100504 | 0 | Cash loans | F | N | Y | 0 | 157500.0 | 187704.0 | 12672.0 | ... | 0.0 | 5.0 | 0.0 | 1.0 | 0.818182 | 0.839087 | 15.0 | 12.0 | 0.150538 | 0.0 |
| 126017 | 100504 | 0 | Cash loans | F | N | Y | 0 | 157500.0 | 187704.0 | 12672.0 | ... | 0.0 | 5.0 | 0.0 | 1.0 | 0.818182 | 0.839087 | 15.0 | 12.0 | 0.150538 | 0.0 |
| 126018 | 100504 | 0 | Cash loans | F | N | Y | 0 | 157500.0 | 187704.0 | 12672.0 | ... | 0.0 | 5.0 | 0.0 | 1.0 | 0.818182 | 0.839087 | 15.0 | 12.0 | 0.150538 | 0.0 |
| 126019 | 100504 | 0 | Cash loans | F | N | Y | 0 | 157500.0 | 187704.0 | 12672.0 | ... | 0.0 | 5.0 | 0.0 | 1.0 | 0.818182 | 0.839087 | 15.0 | 12.0 | 0.150538 | 0.0 |
120357 rows × 103 columns
train_f['ACTIVE_LOANS_PERCENTAGE'] = train_features['ACTIVE_LOANS_PERCENTAGE'][:307511]
train_f['CREDIT_INCOME_RATIO'] = train_features['CREDIT_INCOME_RATIO'][:307511]
train_f['YEARS_TO_PAY'] = train_features['YEARS_TO_PAY'][:307511]
train_f['INCOME_ANNUITY'] = train_features['INCOME_ANNUITY'][:307511]
train_f
| TARGET | EXT_SOURCE_3 | EXT_SOURCE_2 | EXT_SOURCE_1 | DAYS_BIRTH | REGION_RATING_CLIENT_W_CITY | REGION_RATING_CLIENT | DAYS_ID_PUBLISH | REG_CITY_NOT_WORK_CITY | FLAG_EMP_PHONE | DAYS_EMPLOYED | ACTIVE_LOANS_PERCENTAGE | CREDIT_INCOME_RATIO | YEARS_TO_PAY | INCOME_ANNUITY | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | 0.139376 | 0.262949 | 0.083037 | -9461 | 2 | 2 | -2120 | 0 | 1 | -637 | 0.250000 | 0.498036 | 16.0 | 8.0 |
| 1 | 0 | NaN | 0.622246 | 0.311267 | -16765 | 1 | 1 | -291 | 0 | 1 | -1188 | 0.250000 | 0.498036 | 16.0 | 8.0 |
| 2 | 0 | 0.729567 | 0.555912 | NaN | -19046 | 2 | 2 | -2531 | 0 | 1 | -225 | 0.250000 | 0.498036 | 16.0 | 8.0 |
| 3 | 0 | NaN | 0.650442 | NaN | -19005 | 2 | 2 | -2437 | 0 | 1 | -3039 | 0.250000 | 0.498036 | 16.0 | 8.0 |
| 4 | 0 | NaN | 0.322738 | NaN | -19932 | 2 | 2 | -3458 | 1 | 1 | -3038 | 0.250000 | 0.498036 | 16.0 | 8.0 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 307506 | 0 | NaN | 0.681632 | 0.145570 | -9327 | 1 | 1 | -1982 | 0 | 1 | -236 | 0.214286 | 1.472320 | 9.0 | 14.0 |
| 307507 | 0 | NaN | 0.115992 | NaN | -20775 | 2 | 2 | -4090 | 0 | 0 | 365243 | 0.214286 | 1.472320 | 9.0 | 14.0 |
| 307508 | 0 | 0.218859 | 0.535722 | 0.744026 | -14966 | 3 | 3 | -5150 | 1 | 1 | -7921 | 0.214286 | 1.472320 | 9.0 | 14.0 |
| 307509 | 1 | 0.661024 | 0.514163 | NaN | -11961 | 2 | 2 | -931 | 1 | 1 | -4786 | 0.214286 | 1.472320 | 9.0 | 14.0 |
| 307510 | 0 | 0.113922 | 0.708569 | 0.734460 | -16856 | 1 | 1 | -410 | 1 | 1 | -1262 | 0.214286 | 1.472320 | 9.0 | 14.0 |
307511 rows × 15 columns
train_f['TARGET'].value_counts()
0 282686 1 24825 Name: TARGET, dtype: int64
train_zero = train_f[train_f['TARGET'] == 0]
train_zero
| TARGET | EXT_SOURCE_3 | EXT_SOURCE_2 | EXT_SOURCE_1 | DAYS_BIRTH | REGION_RATING_CLIENT_W_CITY | REGION_RATING_CLIENT | DAYS_ID_PUBLISH | REG_CITY_NOT_WORK_CITY | FLAG_EMP_PHONE | DAYS_EMPLOYED | ACTIVE_LOANS_PERCENTAGE | CREDIT_INCOME_RATIO | YEARS_TO_PAY | INCOME_ANNUITY | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | 0 | NaN | 0.622246 | 0.311267 | -16765 | 1 | 1 | -291 | 0 | 1 | -1188 | 0.250000 | 0.498036 | 16.0 | 8.0 |
| 2 | 0 | 0.729567 | 0.555912 | NaN | -19046 | 2 | 2 | -2531 | 0 | 1 | -225 | 0.250000 | 0.498036 | 16.0 | 8.0 |
| 3 | 0 | NaN | 0.650442 | NaN | -19005 | 2 | 2 | -2437 | 0 | 1 | -3039 | 0.250000 | 0.498036 | 16.0 | 8.0 |
| 4 | 0 | NaN | 0.322738 | NaN | -19932 | 2 | 2 | -3458 | 1 | 1 | -3038 | 0.250000 | 0.498036 | 16.0 | 8.0 |
| 5 | 0 | 0.621226 | 0.354225 | NaN | -16941 | 2 | 2 | -477 | 0 | 1 | -1588 | 0.250000 | 0.498036 | 16.0 | 8.0 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 307505 | 0 | 0.742182 | 0.346391 | NaN | -24384 | 2 | 2 | -2357 | 0 | 0 | 365243 | 0.214286 | 1.472320 | 9.0 | 14.0 |
| 307506 | 0 | NaN | 0.681632 | 0.145570 | -9327 | 1 | 1 | -1982 | 0 | 1 | -236 | 0.214286 | 1.472320 | 9.0 | 14.0 |
| 307507 | 0 | NaN | 0.115992 | NaN | -20775 | 2 | 2 | -4090 | 0 | 0 | 365243 | 0.214286 | 1.472320 | 9.0 | 14.0 |
| 307508 | 0 | 0.218859 | 0.535722 | 0.744026 | -14966 | 3 | 3 | -5150 | 1 | 1 | -7921 | 0.214286 | 1.472320 | 9.0 | 14.0 |
| 307510 | 0 | 0.113922 | 0.708569 | 0.734460 | -16856 | 1 | 1 | -410 | 1 | 1 | -1262 | 0.214286 | 1.472320 | 9.0 | 14.0 |
282686 rows × 15 columns
train_zero_f = train_zero.sample(n = 125000, random_state=1)
train_zero_f
| TARGET | EXT_SOURCE_3 | EXT_SOURCE_2 | EXT_SOURCE_1 | DAYS_BIRTH | REGION_RATING_CLIENT_W_CITY | REGION_RATING_CLIENT | DAYS_ID_PUBLISH | REG_CITY_NOT_WORK_CITY | FLAG_EMP_PHONE | DAYS_EMPLOYED | ACTIVE_LOANS_PERCENTAGE | CREDIT_INCOME_RATIO | YEARS_TO_PAY | INCOME_ANNUITY | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 229005 | 0 | 0.713631 | 0.510759 | 0.858157 | -22456 | 2 | 3 | -3841 | 0 | 0 | 365243 | 0.181818 | 0.495050 | 10.0 | 5.0 |
| 14166 | 0 | 0.396220 | 0.652577 | 0.764642 | -19506 | 2 | 2 | -2423 | 0 | 0 | 365243 | 0.294118 | 0.400000 | 21.0 | 9.0 |
| 67925 | 0 | 0.689479 | 0.644868 | NaN | -10245 | 1 | 1 | -1881 | 0 | 1 | -2418 | 0.473684 | 0.179083 | 34.0 | 6.0 |
| 297809 | 0 | NaN | 0.218719 | 0.294392 | -9349 | 3 | 3 | -22 | 0 | 1 | -1282 | 0.500000 | 0.974062 | 10.0 | 10.0 |
| 272986 | 0 | 0.318596 | 0.433441 | NaN | -13464 | 3 | 3 | -3765 | 0 | 1 | -1405 | 0.181818 | 0.272727 | 20.0 | 5.0 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 162751 | 0 | 0.546023 | 0.455792 | 0.650485 | -12033 | 2 | 2 | -197 | 1 | 1 | -563 | 0.272727 | 0.212277 | 29.0 | 6.0 |
| 283535 | 0 | 0.767523 | 0.640635 | NaN | -16100 | 2 | 2 | -4256 | 0 | 1 | -6913 | 0.142857 | 0.944971 | 13.0 | 12.0 |
| 221483 | 0 | 0.749022 | 0.626840 | 0.794135 | -18112 | 2 | 2 | -1668 | 0 | 1 | -5793 | 0.181818 | 0.495050 | 10.0 | 5.0 |
| 11620 | 0 | 0.427657 | 0.229003 | NaN | -14814 | 2 | 2 | -5021 | 0 | 1 | -3227 | 0.294118 | 0.400000 | 21.0 | 9.0 |
| 225177 | 0 | NaN | 0.751044 | NaN | -20220 | 2 | 2 | -3777 | 0 | 1 | -3093 | 0.181818 | 0.495050 | 10.0 | 5.0 |
125000 rows × 15 columns
train_one = train_f[train_f['TARGET'] == 1]
train_one
| TARGET | EXT_SOURCE_3 | EXT_SOURCE_2 | EXT_SOURCE_1 | DAYS_BIRTH | REGION_RATING_CLIENT_W_CITY | REGION_RATING_CLIENT | DAYS_ID_PUBLISH | REG_CITY_NOT_WORK_CITY | FLAG_EMP_PHONE | DAYS_EMPLOYED | ACTIVE_LOANS_PERCENTAGE | CREDIT_INCOME_RATIO | YEARS_TO_PAY | INCOME_ANNUITY | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | 0.139376 | 0.262949 | 0.083037 | -9461 | 2 | 2 | -2120 | 0 | 1 | -637 | 0.250000 | 0.498036 | 16.0 | 8.0 |
| 26 | 1 | 0.190706 | 0.548477 | NaN | -18724 | 2 | 3 | -1827 | 0 | 1 | -2628 | 0.250000 | 0.498036 | 16.0 | 8.0 |
| 40 | 1 | 0.320163 | 0.306841 | NaN | -17482 | 2 | 2 | -1029 | 0 | 1 | -1262 | 0.250000 | 0.498036 | 16.0 | 8.0 |
| 42 | 1 | 0.399676 | 0.674203 | 0.468208 | -13384 | 3 | 3 | -4409 | 0 | 1 | -3597 | 0.250000 | 0.498036 | 16.0 | 8.0 |
| 81 | 1 | 0.720944 | 0.023952 | NaN | -24794 | 2 | 2 | -4199 | 0 | 0 | 365243 | 0.000000 | 0.500000 | 20.0 | 10.0 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 307448 | 1 | 0.360613 | 0.329708 | 0.073452 | -9918 | 3 | 3 | -2580 | 0 | 1 | -3048 | 0.214286 | 1.472320 | 9.0 | 14.0 |
| 307475 | 1 | 0.424130 | 0.583214 | 0.634729 | -13416 | 2 | 2 | -4704 | 0 | 1 | -2405 | 0.214286 | 1.472320 | 9.0 | 14.0 |
| 307481 | 1 | 0.511892 | 0.713524 | NaN | -20644 | 2 | 2 | -3832 | 0 | 1 | -3147 | 0.214286 | 1.472320 | 9.0 | 14.0 |
| 307489 | 1 | 0.397946 | 0.615261 | NaN | -16471 | 2 | 2 | -9 | 0 | 1 | -286 | 0.214286 | 1.472320 | 9.0 | 14.0 |
| 307509 | 1 | 0.661024 | 0.514163 | NaN | -11961 | 2 | 2 | -931 | 1 | 1 | -4786 | 0.214286 | 1.472320 | 9.0 | 14.0 |
24825 rows × 15 columns
train_f1 = pd.concat([train_zero_f, train_one], axis = 0, ignore_index = True)
train_f1
| TARGET | EXT_SOURCE_3 | EXT_SOURCE_2 | EXT_SOURCE_1 | DAYS_BIRTH | REGION_RATING_CLIENT_W_CITY | REGION_RATING_CLIENT | DAYS_ID_PUBLISH | REG_CITY_NOT_WORK_CITY | FLAG_EMP_PHONE | DAYS_EMPLOYED | ACTIVE_LOANS_PERCENTAGE | CREDIT_INCOME_RATIO | YEARS_TO_PAY | INCOME_ANNUITY | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0 | 0.713631 | 0.510759 | 0.858157 | -22456 | 2 | 3 | -3841 | 0 | 0 | 365243 | 0.181818 | 0.495050 | 10.0 | 5.0 |
| 1 | 0 | 0.396220 | 0.652577 | 0.764642 | -19506 | 2 | 2 | -2423 | 0 | 0 | 365243 | 0.294118 | 0.400000 | 21.0 | 9.0 |
| 2 | 0 | 0.689479 | 0.644868 | NaN | -10245 | 1 | 1 | -1881 | 0 | 1 | -2418 | 0.473684 | 0.179083 | 34.0 | 6.0 |
| 3 | 0 | NaN | 0.218719 | 0.294392 | -9349 | 3 | 3 | -22 | 0 | 1 | -1282 | 0.500000 | 0.974062 | 10.0 | 10.0 |
| 4 | 0 | 0.318596 | 0.433441 | NaN | -13464 | 3 | 3 | -3765 | 0 | 1 | -1405 | 0.181818 | 0.272727 | 20.0 | 5.0 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 149820 | 1 | 0.360613 | 0.329708 | 0.073452 | -9918 | 3 | 3 | -2580 | 0 | 1 | -3048 | 0.214286 | 1.472320 | 9.0 | 14.0 |
| 149821 | 1 | 0.424130 | 0.583214 | 0.634729 | -13416 | 2 | 2 | -4704 | 0 | 1 | -2405 | 0.214286 | 1.472320 | 9.0 | 14.0 |
| 149822 | 1 | 0.511892 | 0.713524 | NaN | -20644 | 2 | 2 | -3832 | 0 | 1 | -3147 | 0.214286 | 1.472320 | 9.0 | 14.0 |
| 149823 | 1 | 0.397946 | 0.615261 | NaN | -16471 | 2 | 2 | -9 | 0 | 1 | -286 | 0.214286 | 1.472320 | 9.0 | 14.0 |
| 149824 | 1 | 0.661024 | 0.514163 | NaN | -11961 | 2 | 2 | -931 | 1 | 1 | -4786 | 0.214286 | 1.472320 | 9.0 | 14.0 |
149825 rows × 15 columns
train_f1['TARGET'].value_counts()
0 125000 1 24825 Name: TARGET, dtype: int64
num_pipeline = Pipeline([
('scaler', StandardScaler()),
('imputer', SimpleImputer(strategy = 'mean'))
])
train_f_nt = train_f1.drop('TARGET', axis = 1)
train_f_transformed = num_pipeline.fit_transform(train_f_nt)
train_f_nt
| EXT_SOURCE_3 | EXT_SOURCE_2 | EXT_SOURCE_1 | DAYS_BIRTH | REGION_RATING_CLIENT_W_CITY | REGION_RATING_CLIENT | DAYS_ID_PUBLISH | REG_CITY_NOT_WORK_CITY | FLAG_EMP_PHONE | DAYS_EMPLOYED | ACTIVE_LOANS_PERCENTAGE | CREDIT_INCOME_RATIO | YEARS_TO_PAY | INCOME_ANNUITY | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0.713631 | 0.510759 | 0.858157 | -22456 | 2 | 3 | -3841 | 0 | 0 | 365243 | 0.181818 | 0.495050 | 10.0 | 5.0 |
| 1 | 0.396220 | 0.652577 | 0.764642 | -19506 | 2 | 2 | -2423 | 0 | 0 | 365243 | 0.294118 | 0.400000 | 21.0 | 9.0 |
| 2 | 0.689479 | 0.644868 | NaN | -10245 | 1 | 1 | -1881 | 0 | 1 | -2418 | 0.473684 | 0.179083 | 34.0 | 6.0 |
| 3 | NaN | 0.218719 | 0.294392 | -9349 | 3 | 3 | -22 | 0 | 1 | -1282 | 0.500000 | 0.974062 | 10.0 | 10.0 |
| 4 | 0.318596 | 0.433441 | NaN | -13464 | 3 | 3 | -3765 | 0 | 1 | -1405 | 0.181818 | 0.272727 | 20.0 | 5.0 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 149820 | 0.360613 | 0.329708 | 0.073452 | -9918 | 3 | 3 | -2580 | 0 | 1 | -3048 | 0.214286 | 1.472320 | 9.0 | 14.0 |
| 149821 | 0.424130 | 0.583214 | 0.634729 | -13416 | 2 | 2 | -4704 | 0 | 1 | -2405 | 0.214286 | 1.472320 | 9.0 | 14.0 |
| 149822 | 0.511892 | 0.713524 | NaN | -20644 | 2 | 2 | -3832 | 0 | 1 | -3147 | 0.214286 | 1.472320 | 9.0 | 14.0 |
| 149823 | 0.397946 | 0.615261 | NaN | -16471 | 2 | 2 | -9 | 0 | 1 | -286 | 0.214286 | 1.472320 | 9.0 | 14.0 |
| 149824 | 0.661024 | 0.514163 | NaN | -11961 | 2 | 2 | -931 | 1 | 1 | -4786 | 0.214286 | 1.472320 | 9.0 | 14.0 |
149825 rows × 14 columns
train_f_transformed
array([[ 1.07416901e+00, 2.91790754e-02, 1.71950528e+00, ...,
3.50250579e-01, -1.53231874e+00, -5.89779917e-01],
[-5.22701628e-01, 7.53242843e-01, 1.28004684e+00, ...,
2.18019534e-02, -1.66569121e-01, 3.71126751e-01],
[ 9.52661156e-01, 7.13884723e-01, 4.67519818e-16, ...,
-7.41588177e-01, 1.44749861e+00, -3.49553250e-01],
...,
[ 5.92350574e-02, 1.06441836e+00, 4.67519818e-16, ...,
3.72726211e+00, -1.65647779e+00, 1.57226009e+00],
[-5.14014503e-01, 5.62725792e-01, 4.67519818e-16, ...,
3.72726211e+00, -1.65647779e+00, 1.57226009e+00],
[ 8.09503549e-01, 4.65572525e-02, 4.67519818e-16, ...,
3.72726211e+00, -1.65647779e+00, 1.57226009e+00]])
column_names = list(train_f_nt.columns)
train_f_transformed_df = pd.DataFrame(train_f_transformed, columns=column_names)
train_f_transformed_df['TARGET'] = train_f1['TARGET']
train_f_transformed_df
| EXT_SOURCE_3 | EXT_SOURCE_2 | EXT_SOURCE_1 | DAYS_BIRTH | REGION_RATING_CLIENT_W_CITY | REGION_RATING_CLIENT | DAYS_ID_PUBLISH | REG_CITY_NOT_WORK_CITY | FLAG_EMP_PHONE | DAYS_EMPLOYED | ACTIVE_LOANS_PERCENTAGE | CREDIT_INCOME_RATIO | YEARS_TO_PAY | INCOME_ANNUITY | TARGET | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1.074169e+00 | 0.029179 | 1.719505e+00 | -1.497423 | -0.081634 | 1.843341 | -0.575876 | -0.557098 | -2.176948 | 2.177106 | -0.883071 | 0.350251 | -1.532319 | -0.589780 | 0 |
| 1 | -5.227016e-01 | 0.753243 | 1.280047e+00 | -0.820308 | -0.081634 | -0.120414 | 0.362029 | -0.557098 | -2.176948 | 2.177106 | -0.275281 | 0.021802 | -0.166569 | 0.371127 | 0 |
| 2 | 9.526612e-01 | 0.713885 | 4.675198e-16 | 1.305372 | -2.069750 | -2.084169 | 0.720523 | -0.557098 | 0.459359 | -0.459874 | 0.696573 | -0.741588 | 1.447499 | -0.349553 | 0 |
| 3 | 2.924445e-16 | -1.461859 | -9.298265e-01 | 1.511031 | 1.906481 | 1.843341 | 1.950117 | -0.557098 | 0.459359 | -0.451726 | 0.839000 | 2.005504 | -1.532319 | 0.611353 | 0 |
| 4 | -9.132213e-01 | -0.365578 | 4.675198e-16 | 0.566514 | 1.906481 | 1.843341 | -0.525607 | -0.557098 | 0.459359 | -0.452608 | -0.883071 | -0.417996 | -0.290728 | -0.589780 | 0 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 149820 | -7.018373e-01 | -0.895198 | -1.968105e+00 | 1.380428 | 1.906481 | 1.843341 | 0.258185 | -0.557098 | 0.459359 | -0.464392 | -0.707349 | 3.727262 | -1.656478 | 1.572260 | 1 |
| 149821 | -3.822852e-01 | 0.399105 | 6.695401e-01 | 0.577531 | -0.081634 | -0.120414 | -1.146688 | -0.557098 | 0.459359 | -0.459781 | -0.707349 | 3.727262 | -1.656478 | 1.572260 | 1 |
| 149822 | 5.923506e-02 | 1.064418 | 4.675198e-16 | -1.081514 | -0.081634 | -0.120414 | -0.569923 | -0.557098 | 0.459359 | -0.465102 | -0.707349 | 3.727262 | -1.656478 | 1.572260 | 1 |
| 149823 | -5.140145e-01 | 0.562726 | 4.675198e-16 | -0.123684 | -0.081634 | -0.120414 | 1.958716 | -0.557098 | 0.459359 | -0.444582 | -0.707349 | 3.727262 | -1.656478 | 1.572260 | 1 |
| 149824 | 8.095035e-01 | 0.046557 | 4.675198e-16 | 0.911498 | -0.081634 | -0.120414 | 1.348879 | 1.795017 | 0.459359 | -0.476858 | -0.707349 | 3.727262 | -1.656478 | 1.572260 | 1 |
149825 rows × 15 columns
train_f_transformed_df['TARGET'].isnull().sum()
0
train_f_transformed_df['TARGET'].value_counts()
0 125000 1 24825 Name: TARGET, dtype: int64
train_f_transformed_df.to_csv("train_nn.csv")
X = train_f_transformed_df.drop(['TARGET'], axis = 1).values
y = train_f_transformed_df["TARGET"].values
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=42)
len(X_test)
29965
X_train = torch.FloatTensor(X_train)
X_test = torch.FloatTensor(X_test)
y_train = torch.tensor(y_train, dtype=torch.long, device=device)
y_test = torch.tensor(y_test, dtype=torch.long, device=device)
class MLP(nn.Module):
def __init__(self,input_features=14,hidden1=20,hidden2=20,out_features=2):
super().__init__()
self.f_connected1 = nn.Linear(input_features,hidden1)
self.f_connected2 = nn.Linear(hidden1,hidden2)
self.out = nn.Linear(hidden2,out_features)
def forward(self,x):
x = F.leaky_relu(self.f_connected1(x))
x = F.leaky_relu(self.f_connected2(x))
x = self.out(x)
return x
class SLP(nn.Module):
def __init__(self,input_features=14,hidden1=20,out_features=2):
super().__init__()
self.f_connected1 = nn.Linear(input_features,hidden1)
self.out = nn.Linear(hidden1,out_features)
def forward(self,x):
x = F.leaky_relu(self.f_connected1(x))
x = self.out(x)
return x
torch.manual_seed(20)
model= MLP()
torch.manual_seed(20)
model2= SLP()
model.parameters
<bound method Module.parameters of MLP( (f_connected1): Linear(in_features=14, out_features=20, bias=True) (f_connected2): Linear(in_features=20, out_features=20, bias=True) (out): Linear(in_features=20, out_features=2, bias=True) )>
loss_function = nn.CrossEntropyLoss()
optimizer = torch.optim.Adam(model.parameters(),lr=0.01)
size_ = y_train.shape[0]
acc = 0
epochs = 1000
final_losses = []
for i in range(epochs):
i += 1
y_pred = model.forward(X_train)
loss = loss_function(y_pred,y_train)
final_losses.append(loss.item())
_, predicted = torch.max(y_pred, 1)
acc = (predicted == y_train).sum().item()
if i%100 == 1:
print("Epoch Number: {} and the loss: {} accuracy {}".format(i,loss.item(), acc/size_))
writer.add_scalar('Training loss', loss.item(), i)
writer.add_scalar('Accuracy', acc/size_, i)
acc = 0
optimizer.zero_grad()
loss.backward()
optimizer.step()
Epoch Number: 1 and the loss: 0.39586523175239563 accuracy 0.8388536626063741 Epoch Number: 101 and the loss: 0.39574986696243286 accuracy 0.83882863340564 Epoch Number: 201 and the loss: 0.39562028646469116 accuracy 0.8390038378107793 Epoch Number: 301 and the loss: 0.39556336402893066 accuracy 0.8388870348740197 Epoch Number: 401 and the loss: 0.39552903175354004 accuracy 0.8389537794093108 Epoch Number: 501 and the loss: 0.3954128921031952 accuracy 0.838978808610045 Epoch Number: 601 and the loss: 0.39543044567108154 accuracy 0.8390705823460705 Epoch Number: 701 and the loss: 0.39534348249435425 accuracy 0.839020523944602 Epoch Number: 801 and the loss: 0.39528822898864746 accuracy 0.8388870348740197 Epoch Number: 901 and the loss: 0.3952784836292267 accuracy 0.838895377940931
size_ = y_train.shape[0]
acc = 0
epochs = 1000
final_losses = []
for i in range(epochs):
i += 1
y_pred = model2.forward(X_train)
loss = loss_function(y_pred,y_train)
final_losses.append(loss.item())
_, predicted = torch.max(y_pred, 1)
acc = (predicted == y_train).sum().item()
if i%100 == 1:
print("Epoch Number: {} and the loss: {} accuracy {}".format(i,loss.item(), acc/size_))
writer.add_scalar('Training loss', loss.item(), i)
writer.add_scalar('Accuracy', acc/size_, i)
acc = 0
optimizer.zero_grad()
loss.backward()
optimizer.step()
Epoch Number: 1 and the loss: 0.5945405960083008 accuracy 0.7735357917570499 Epoch Number: 101 and the loss: 0.5945405960083008 accuracy 0.7735357917570499 Epoch Number: 201 and the loss: 0.5945405960083008 accuracy 0.7735357917570499 Epoch Number: 301 and the loss: 0.5945405960083008 accuracy 0.7735357917570499 Epoch Number: 401 and the loss: 0.5945405960083008 accuracy 0.7735357917570499 Epoch Number: 501 and the loss: 0.5945405960083008 accuracy 0.7735357917570499 Epoch Number: 601 and the loss: 0.5945405960083008 accuracy 0.7735357917570499 Epoch Number: 701 and the loss: 0.5945405960083008 accuracy 0.7735357917570499 Epoch Number: 801 and the loss: 0.5945405960083008 accuracy 0.7735357917570499 Epoch Number: 901 and the loss: 0.5945405960083008 accuracy 0.7735357917570499
writer.add_graph(model, X_train)
writer.close()
predictions = []
with torch.no_grad():
for i,data in enumerate(X_test):
y_pred = model(data)
predictions.append(y_pred.argmax().item())
from sklearn.metrics import confusion_matrix
cm = confusion_matrix(y_test,predictions)
cm
array([[24676, 300],
[ 4620, 369]])
from sklearn.metrics import accuracy_score
score = accuracy_score(y_test,predictions)
score
0.8358084431837143
predictions2 = []
with torch.no_grad():
for i,data in enumerate(X_test):
y_pred = model2(data)
predictions2.append(y_pred.argmax().item())
from sklearn.metrics import confusion_matrix
cm = confusion_matrix(y_test,predictions)
cm
array([[24600, 376],
[ 4559, 430]])
from sklearn.metrics import accuracy_score
score = accuracy_score(y_test,predictions)
score
0.8353078591690305
len(probabilities)
49965
len(X_test)
49965
For each SK_ID_CURR in the test set, you must predict a probability for the TARGET variable. The file should contain a header and have the following format:
SK_ID_CURR,TARGET
100001,0.1
100005,0.9
100013,0.2
etc.
X_kaggle_test = datasets["application_test"]
X_kaggle_test
| SK_ID_CURR | NAME_CONTRACT_TYPE | CODE_GENDER | FLAG_OWN_CAR | FLAG_OWN_REALTY | CNT_CHILDREN | AMT_INCOME_TOTAL | AMT_CREDIT | AMT_ANNUITY | AMT_GOODS_PRICE | ... | FLAG_DOCUMENT_18 | FLAG_DOCUMENT_19 | FLAG_DOCUMENT_20 | FLAG_DOCUMENT_21 | AMT_REQ_CREDIT_BUREAU_HOUR | AMT_REQ_CREDIT_BUREAU_DAY | AMT_REQ_CREDIT_BUREAU_WEEK | AMT_REQ_CREDIT_BUREAU_MON | AMT_REQ_CREDIT_BUREAU_QRT | AMT_REQ_CREDIT_BUREAU_YEAR | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 100001 | Cash loans | F | N | Y | 0 | 135000.0 | 568800.0 | 20560.5 | 450000.0 | ... | 0 | 0 | 0 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
| 1 | 100005 | Cash loans | M | N | Y | 0 | 99000.0 | 222768.0 | 17370.0 | 180000.0 | ... | 0 | 0 | 0 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 3.0 |
| 2 | 100013 | Cash loans | M | Y | Y | 0 | 202500.0 | 663264.0 | 69777.0 | 630000.0 | ... | 0 | 0 | 0 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 4.0 |
| 3 | 100028 | Cash loans | F | N | Y | 2 | 315000.0 | 1575000.0 | 49018.5 | 1575000.0 | ... | 0 | 0 | 0 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 3.0 |
| 4 | 100038 | Cash loans | M | Y | N | 1 | 180000.0 | 625500.0 | 32067.0 | 625500.0 | ... | 0 | 0 | 0 | 0 | NaN | NaN | NaN | NaN | NaN | NaN |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 48739 | 456221 | Cash loans | F | N | Y | 0 | 121500.0 | 412560.0 | 17473.5 | 270000.0 | ... | 0 | 0 | 0 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 |
| 48740 | 456222 | Cash loans | F | N | N | 2 | 157500.0 | 622413.0 | 31909.5 | 495000.0 | ... | 0 | 0 | 0 | 0 | NaN | NaN | NaN | NaN | NaN | NaN |
| 48741 | 456223 | Cash loans | F | Y | Y | 1 | 202500.0 | 315000.0 | 33205.5 | 315000.0 | ... | 0 | 0 | 0 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 3.0 | 1.0 |
| 48742 | 456224 | Cash loans | M | N | N | 0 | 225000.0 | 450000.0 | 25128.0 | 450000.0 | ... | 0 | 0 | 0 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 2.0 |
| 48743 | 456250 | Cash loans | F | Y | N | 0 | 135000.0 | 312768.0 | 24709.5 | 270000.0 | ... | 0 | 0 | 0 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 4.0 |
48744 rows × 116 columns
#X_kaggle_test.drop(['DAYS_LAST_PHONE_CHANGE','OBS_30_CNT_SOCIAL_CIRCLE','OBS_60_CNT_SOCIAL_CIRCLE','DEF_30_CNT_SOCIAL_CIRCLE','DEF_60_CNT_SOCIAL_CIRCLE'], axis = 1, inplace = True)
X_kaggle_test = X_kaggle_test.loc[:, ~X_kaggle_test.columns.str.startswith("FLAG_DOCUMENT_")]
X_kaggle_test = X_kaggle_test.loc[:, ~X_kaggle_test.columns.str.endswith("MODE")]
X_kaggle_test = X_kaggle_test.loc[:, ~X_kaggle_test.columns.str.endswith("MEDI")]
X_kaggle_test = X_kaggle_test.loc[:, ~X_kaggle_test.columns.str.endswith("AVG")]
X_kaggle_test.shape
(48744, 49)
X_kaggle_test['ACTIVE_LOANS_PERCENTAGE'] = train_features_fe['ACTIVE_LOANS_PERCENTAGE']
X_kaggle_test['CREDIT_INCOME_RATIO'] = train_features_fe['CREDIT_INCOME_RATIO']
X_kaggle_test['YEARS_TO_PAY'] = train_features_fe['YEARS_TO_PAY']
X_kaggle_test['INCOME_ANNUITY'] = train_features_fe['INCOME_ANNUITY']
X_kaggle_test.shape
(48744, 53)
X_kaggle_test
| SK_ID_CURR | NAME_CONTRACT_TYPE | CODE_GENDER | FLAG_OWN_CAR | FLAG_OWN_REALTY | CNT_CHILDREN | AMT_INCOME_TOTAL | AMT_CREDIT | AMT_ANNUITY | AMT_GOODS_PRICE | ... | AMT_REQ_CREDIT_BUREAU_HOUR | AMT_REQ_CREDIT_BUREAU_DAY | AMT_REQ_CREDIT_BUREAU_WEEK | AMT_REQ_CREDIT_BUREAU_MON | AMT_REQ_CREDIT_BUREAU_QRT | AMT_REQ_CREDIT_BUREAU_YEAR | ACTIVE_LOANS_PERCENTAGE | CREDIT_INCOME_RATIO | YEARS_TO_PAY | INCOME_ANNUITY | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 100001 | Cash loans | F | N | Y | 0 | 135000.0 | 568800.0 | 20560.5 | 450000.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.181818 | 0.495050 | 10.0 | 5.0 |
| 1 | 100005 | Cash loans | M | N | Y | 0 | 99000.0 | 222768.0 | 17370.0 | 180000.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 3.0 | 0.294118 | 0.400000 | 21.0 | 9.0 |
| 2 | 100013 | Cash loans | M | Y | Y | 0 | 202500.0 | 663264.0 | 69777.0 | 630000.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 4.0 | 0.473684 | 0.179083 | 34.0 | 6.0 |
| 3 | 100028 | Cash loans | F | N | Y | 2 | 315000.0 | 1575000.0 | 49018.5 | 1575000.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 3.0 | 0.500000 | 0.974062 | 10.0 | 10.0 |
| 4 | 100038 | Cash loans | M | Y | N | 1 | 180000.0 | 625500.0 | 32067.0 | 625500.0 | ... | NaN | NaN | NaN | NaN | NaN | NaN | 0.181818 | 0.272727 | 20.0 | 5.0 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 48739 | 456221 | Cash loans | F | N | Y | 0 | 121500.0 | 412560.0 | 17473.5 | 270000.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.818182 | 0.839087 | 15.0 | 12.0 |
| 48740 | 456222 | Cash loans | F | N | N | 2 | 157500.0 | 622413.0 | 31909.5 | 495000.0 | ... | NaN | NaN | NaN | NaN | NaN | NaN | 0.500000 | 0.588928 | 10.0 | 6.0 |
| 48741 | 456223 | Cash loans | F | Y | Y | 1 | 202500.0 | 315000.0 | 33205.5 | 315000.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 3.0 | 1.0 | 0.375000 | 0.263190 | 20.0 | 5.0 |
| 48742 | 456224 | Cash loans | M | N | N | 0 | 225000.0 | 450000.0 | 25128.0 | 450000.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 2.0 | 0.333333 | 0.300000 | 19.0 | 6.0 |
| 48743 | 456250 | Cash loans | F | Y | N | 0 | 135000.0 | 312768.0 | 24709.5 | 270000.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 4.0 | 0.181818 | 0.495050 | 10.0 | 5.0 |
48744 rows × 53 columns
X_kaggle_test.drop(['SK_ID_CURR'], axis = 1, inplace = True)
X_kaggle_test.drop(['AVG_DPD'], axis = 1, inplace = True)
test_class_scores = after_balancing_model.predict_proba(X_kaggle_test)[:, 1]
# Submission dataframe
submit_df = datasets["application_test"][['SK_ID_CURR']]
submit_df['TARGET'] = test_class_scores
submit_df.head()
| SK_ID_CURR | TARGET | |
|---|---|---|
| 0 | 100001 | 0.210149 |
| 1 | 100005 | 0.443293 |
| 2 | 100013 | 0.200282 |
| 3 | 100028 | 0.138116 |
| 4 | 100038 | 0.358870 |
X_kaggle_test = datasets["application_test"]
train_f1
| TARGET | EXT_SOURCE_3 | EXT_SOURCE_2 | EXT_SOURCE_1 | DAYS_BIRTH | REGION_RATING_CLIENT_W_CITY | REGION_RATING_CLIENT | DAYS_ID_PUBLISH | REG_CITY_NOT_WORK_CITY | FLAG_EMP_PHONE | DAYS_EMPLOYED | ACTIVE_LOANS_PERCENTAGE | CREDIT_INCOME_RATIO | YEARS_TO_PAY | INCOME_ANNUITY | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0 | 0.713631 | 0.510759 | 0.858157 | -22456 | 2 | 3 | -3841 | 0 | 0 | 365243 | 0.181818 | 0.495050 | 10.0 | 5.0 |
| 1 | 0 | 0.396220 | 0.652577 | 0.764642 | -19506 | 2 | 2 | -2423 | 0 | 0 | 365243 | 0.294118 | 0.400000 | 21.0 | 9.0 |
| 2 | 0 | 0.689479 | 0.644868 | NaN | -10245 | 1 | 1 | -1881 | 0 | 1 | -2418 | 0.473684 | 0.179083 | 34.0 | 6.0 |
| 3 | 0 | NaN | 0.218719 | 0.294392 | -9349 | 3 | 3 | -22 | 0 | 1 | -1282 | 0.500000 | 0.974062 | 10.0 | 10.0 |
| 4 | 0 | 0.318596 | 0.433441 | NaN | -13464 | 3 | 3 | -3765 | 0 | 1 | -1405 | 0.181818 | 0.272727 | 20.0 | 5.0 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 149820 | 1 | 0.360613 | 0.329708 | 0.073452 | -9918 | 3 | 3 | -2580 | 0 | 1 | -3048 | 0.214286 | 1.472320 | 9.0 | 14.0 |
| 149821 | 1 | 0.424130 | 0.583214 | 0.634729 | -13416 | 2 | 2 | -4704 | 0 | 1 | -2405 | 0.214286 | 1.472320 | 9.0 | 14.0 |
| 149822 | 1 | 0.511892 | 0.713524 | NaN | -20644 | 2 | 2 | -3832 | 0 | 1 | -3147 | 0.214286 | 1.472320 | 9.0 | 14.0 |
| 149823 | 1 | 0.397946 | 0.615261 | NaN | -16471 | 2 | 2 | -9 | 0 | 1 | -286 | 0.214286 | 1.472320 | 9.0 | 14.0 |
| 149824 | 1 | 0.661024 | 0.514163 | NaN | -11961 | 2 | 2 | -931 | 1 | 1 | -4786 | 0.214286 | 1.472320 | 9.0 | 14.0 |
149825 rows × 15 columns
X_kaggle_test_f = X_kaggle_test[['EXT_SOURCE_3','EXT_SOURCE_2','EXT_SOURCE_1','DAYS_BIRTH','REGION_RATING_CLIENT_W_CITY','REGION_RATING_CLIENT','DAYS_ID_PUBLISH','REG_CITY_NOT_WORK_CITY','FLAG_EMP_PHONE','DAYS_EMPLOYED']]
X_kaggle_test_f['ACTIVE_LOANS_PERCENTAGE'] = train_f1['ACTIVE_LOANS_PERCENTAGE']
X_kaggle_test_f['CREDIT_INCOME_RATIO'] = train_f1['CREDIT_INCOME_RATIO']
X_kaggle_test_f['YEARS_TO_PAY'] = train_f1['YEARS_TO_PAY']
X_kaggle_test_f['INCOME_ANNUITY'] = train_f1['INCOME_ANNUITY']
X_kaggle_test_f.shape
(48744, 14)
num_pipeline = Pipeline([
('scaler', StandardScaler()),
('imputer', SimpleImputer(strategy = 'mean'))
])
X_kaggle_test_f_transformed = num_pipeline.fit_transform(X_kaggle_test_f)
column_names = list(numerical_features)
X_kaggle_test_transformed_df = pd.DataFrame(X_kaggle_test_transformed, columns=column_names)
X_kaggle_test_nn = torch.FloatTensor(X_kaggle_test_f_transformed)
predictions = []
probabilities = []
with torch.no_grad():
for i,data in enumerate(X_kaggle_test_nn):
y_pred = model(data)
probabilities.append(F.softmax(y_pred)[1].item())
predictions.append(y_pred.argmax().item())
# print(y_pred.argmax().item())
len(probabilities)
48744
# Submission dataframe
submit_df = datasets["application_test"][['SK_ID_CURR']]
submit_df['TARGET'] = probabilities
submit_df.head()
| SK_ID_CURR | TARGET | |
|---|---|---|
| 0 | 100001 | 0.106409 |
| 1 | 100005 | 0.165703 |
| 2 | 100013 | 0.053740 |
| 3 | 100028 | 0.104068 |
| 4 | 100038 | 0.211880 |
submit_df.to_csv("submission.csv",index=False)
X_kaggle_test = X_kaggle_test.loc[:, ~X_kaggle_test.columns.str.startswith("FLAG_DOCUMENT_")]
X_kaggle_test.shape
(48744, 96)
X_kaggle_test['ACTIVE_LOANS_PERCENTAGE'] = train['ACTIVE_LOANS_PERCENTAGE']
X_kaggle_test['CREDIT_INCOME_RATIO'] = train['CREDIT_INCOME_RATIO']
X_kaggle_test['YEARS_TO_PAY'] = train['YEARS_TO_PAY']
X_kaggle_test['INCOME_ANNUITY'] = train['INCOME_ANNUITY']
X_kaggle_test['AVG_DPD'] = train['AVG_DPD']
X_kaggle_test
| SK_ID_CURR | NAME_CONTRACT_TYPE | CODE_GENDER | FLAG_OWN_CAR | FLAG_OWN_REALTY | CNT_CHILDREN | AMT_INCOME_TOTAL | AMT_CREDIT | AMT_ANNUITY | AMT_GOODS_PRICE | ... | AMT_REQ_CREDIT_BUREAU_DAY | AMT_REQ_CREDIT_BUREAU_WEEK | AMT_REQ_CREDIT_BUREAU_MON | AMT_REQ_CREDIT_BUREAU_QRT | AMT_REQ_CREDIT_BUREAU_YEAR | ACTIVE_LOANS_PERCENTAGE | CREDIT_INCOME_RATIO | YEARS_TO_PAY | INCOME_ANNUITY | AVG_DPD | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 100001 | Cash loans | F | N | Y | 0 | 135000.0 | 568800.0 | 20560.5 | 450000.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.250000 | 0.498036 | 16.0 | 8.0 | 0.000000 |
| 1 | 100005 | Cash loans | M | N | Y | 0 | 99000.0 | 222768.0 | 17370.0 | 180000.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 3.0 | 0.250000 | 0.498036 | 16.0 | 8.0 | 0.000000 |
| 2 | 100013 | Cash loans | M | Y | Y | 0 | 202500.0 | 663264.0 | 69777.0 | 630000.0 | ... | 0.0 | 0.0 | 0.0 | 1.0 | 4.0 | 0.250000 | 0.498036 | 16.0 | 8.0 | 0.000000 |
| 3 | 100028 | Cash loans | F | N | Y | 2 | 315000.0 | 1575000.0 | 49018.5 | 1575000.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 3.0 | 0.250000 | 0.498036 | 16.0 | 8.0 | 0.000000 |
| 4 | 100038 | Cash loans | M | Y | N | 1 | 180000.0 | 625500.0 | 32067.0 | 625500.0 | ... | NaN | NaN | NaN | NaN | NaN | 0.250000 | 0.498036 | 16.0 | 8.0 | 0.000000 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 48739 | 456221 | Cash loans | F | N | Y | 0 | 121500.0 | 412560.0 | 17473.5 | 270000.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.578947 | 0.834725 | 18.0 | 15.0 | 0.049383 |
| 48740 | 456222 | Cash loans | F | N | N | 2 | 157500.0 | 622413.0 | 31909.5 | 495000.0 | ... | NaN | NaN | NaN | NaN | NaN | 0.578947 | 0.834725 | 18.0 | 15.0 | 0.049383 |
| 48741 | 456223 | Cash loans | F | Y | Y | 1 | 202500.0 | 315000.0 | 33205.5 | 315000.0 | ... | 0.0 | 0.0 | 0.0 | 3.0 | 1.0 | 0.578947 | 0.834725 | 18.0 | 15.0 | 0.049383 |
| 48742 | 456224 | Cash loans | M | N | N | 0 | 225000.0 | 450000.0 | 25128.0 | 450000.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 2.0 | 0.578947 | 0.834725 | 18.0 | 15.0 | 0.049383 |
| 48743 | 456250 | Cash loans | F | Y | N | 0 | 135000.0 | 312768.0 | 24709.5 | 270000.0 | ... | 0.0 | 0.0 | 0.0 | 1.0 | 4.0 | 0.578947 | 0.834725 | 18.0 | 15.0 | 0.049383 |
48744 rows × 101 columns
numerical_features = X_kaggle_test.select_dtypes(include = ['int64', 'float64']).columns
categorical_features = X_kaggle_test.select_dtypes(include = ['object', 'bool']).columns
num_pipeline = Pipeline([
('scaler', StandardScaler()),
('imputer', SimpleImputer(strategy = 'median'))
])
cat_pipeline = Pipeline([
('imputer', SimpleImputer(strategy='most_frequent')),
('ohe', OneHotEncoder(sparse=False, handle_unknown="ignore"))
])
data_pipeline = ColumnTransformer([
("num_pipeline", num_pipeline, numerical_features),
("cat_pipeline", cat_pipeline, categorical_features)], remainder = 'drop', n_jobs = -1)
X_kaggle_test_transformed = data_pipeline.fit_transform(X_kaggle_test)
column_names = list(numerical_features) + \
list(data_pipeline.transformers_[1][1].named_steps["ohe"].get_feature_names(categorical_features))
X_kaggle_test_transformed_df = pd.DataFrame(X_kaggle_test_transformed, columns=column_names)
X_kaggle_test_transformed_df.shape
(48744, 225)
X_train_transformed_df.shape
(1088000, 225)
set(X_train_transformed_df.columns).difference(set(X_kaggle_test_transformed_df.columns))
{'CODE_GENDER_XNA',
'NAME_FAMILY_STATUS_Unknown',
'NAME_INCOME_TYPE_Maternity leave'}
X_kaggle_test_transformed_df['CODE_GENDER_XNA'] = X_train_transformed_df['CODE_GENDER_XNA']
X_kaggle_test_transformed_df['NAME_FAMILY_STATUS_Unknown'] = X_train_transformed_df['NAME_FAMILY_STATUS_Unknown']
X_kaggle_test_transformed_df['NAME_INCOME_TYPE_Maternity leave'] = X_train_transformed_df['NAME_INCOME_TYPE_Maternity leave']
X_kaggle_test_transformed_4 = X_kaggle_test_transformed_df.to_numpy()
test_class_scores = grid_search_model1.predict_proba(X_kaggle_test_transformed_4)[:, 1]
test_class_scores[0:10]
array([0.01205682, 0.23306839, 0.11699465, 0.10186553, 0.05286666,
0.20762748, 0.05490155, 0.07793292, 0.00758586, 0.12317721])
# Submission dataframe
submit_df = datasets["application_test"][['SK_ID_CURR']]
submit_df['TARGET'] = test_class_scores
submit_df.head()
| SK_ID_CURR | TARGET | |
|---|---|---|
| 0 | 100001 | 0.012057 |
| 1 | 100005 | 0.233068 |
| 2 | 100013 | 0.116995 |
| 3 | 100028 | 0.101866 |
| 4 | 100038 | 0.052867 |
submit_df.to_csv("submission.csv",index=False)
! kaggle competitions submit -c home-credit-default-risk -f submission.csv -m "Neural Network Submission"
100%|██████████████████████████████████████| 1.25M/1.25M [00:01<00:00, 1.25MB/s] Successfully submitted to Home Credit Default Risk
file:///N/u/vshriram/Carbonate/Desktop/kagglePhase3.png
The main goal of this project is to build an optimal ML model which will predict if a loan applicant will be able to repay his/her loan. In Phase 2, we extended our work from Phase 1 and implemented Feature Engineering where we consider potential features from other tables, did feature selection from the derived features, analysis of feature importances and implemented Hyper Parameter Tuning. In feature engineering, we derived 6 additional features and we achieved an improvement over our baseline model however some feature models are leading to over-fitting and thus in our future scope we aim to select the most important features having balanced data to avoid over-fitting. The most important feature relevant to our goal was the Credit Annuity ratio of the current application. We identified this by implementing feature importances on our model. After performing Hyper-parameter tuning on logistic regression(our best pipeline) we achieved the test accuracy of 92.46%. In Phase 3, we extended our work from Phase 2 and we performed a deep learning algorithm which will predict our goal of the project. The Deep Learning Algorithm used was the Multi-Layer Perceptron which is a kind of Artificial Neural network. The main goal of this phase was to implement MLP and visualize the training model on TensorBoard. We also identified Data Leakage in our project and their respective reasons. Another goal of this phase was to improve our results from Phase 2 and we were successful in doing that.
Data Description:
The main table is divided into two files: Train (with TARGET) and Test (without TARGET) (without TARGET).
All past credit issued to the client by other financial institutions and reported to the Credit Bureau.
Monthly balances of previous credits in Credit Bureau.
Monthly balance snapshots of the applicant's prior POS (point of sale) and cash loans with Home Credit.
Monthly balance snapshots of the applicant's prior credit cards with Home Credit.
All prior Home Credit loan applications of clients with loans in our sample.
Payment history in Home Credit for previously disbursed credits related to the loans in our sample.
The columns in the various data files are described in this file.
The tasks to be tackled are:
In Phase 2, we implemented Feature engineering to identify potential features that could help us get better results. We mainly derived 6 features:
num_pipeline = Pipeline([
('scaler', StandardScaler()),
('imputer', SimpleImputer(strategy = 'median'))
])
cat_pipeline = Pipeline([
('imputer', SimpleImputer(strategy='most_frequent')),
('ohe', OneHotEncoder(sparse=False, handle_unknown="ignore"))
])
data_pipeline = ColumnTransformer([
("num_pipeline", num_pipeline, numerical_features),
("cat_pipeline", cat_pipeline, categorical_features)], remainder = 'drop', n_jobs = -1)
Here we created two different pipelines for numerical and categorical features respectively. Performed standardization and imputation on the numerical features and performed imputations and one-hot encoding on the categorical features. We combined the two pipelines using Column Transformer and passed for modeling.
clf_pipe = make_pipeline(data_pipeline, LogisticRegression())
We are passing the combined data pipeline to Logistic Regression model in this pipeline
RF = RandomForestClassifier(random_state = 42,n_estimators=20, criterion='gini', max_depth=6)
data_pipeline_rf = make_pipeline(data_pipeline, RF)
Here we are passing the data pipeline to the Random Forest classifier
We derived 6 features in total out of which we selected 5 for our modeling.
Impact of these features to the model -
Cross Entropy Loss criterion computes the cross entropy loss between input and target.
It is useful when training a classification problem with C classes. If provided, the optional argument weight should be a 1D Tensor assigning weight to each of the classes. This is particularly useful when you have an unbalanced training set.
In our project, we do have a unbalanced training set as we have about 270000 target variables as 0 (i.e people who havent repaid their loan) among 300000 data points.
Equation -
Cross-entropy can be calculated using the probabilities of the events from P and Q, as follows:
H(P, Q) = – sum x in X P(x) * log(Q(x))
Where P(x) is the probability of the event x in P, Q(x) is the probability of event x in Q and log is the base-2 logarithm
class SLP(nn.Module):
def __init__(self,input_features=14,hidden1=20,out_features=2):
super().__init__()
self.f_connected1 = nn.Linear(input_features,hidden1)
self.out = nn.Linear(hidden1,out_features)
def forward(self,x):
x = F.leaky_relu(self.f_connected1(x))
x = self.out(x)
return x
In our Neural network we have used 14 input features and 1 linear hidden layer having 20 neurons and 2 output features.
In the forward propagation we have used leaky_relu as the activation function
class MLP(nn.Module):
def __init__(self,input_features=14,hidden1=20,hidden2=20,out_features=2):
super().__init__()
self.f_connected1 = nn.Linear(input_features,hidden1)
self.f_connected2 = nn.Linear(hidden1,hidden2)
self.out = nn.Linear(hidden2,out_features)
def forward(self,x):
x = F.leaky_relu(self.f_connected1(x))
x = F.leaky_relu(self.f_connected2(x))
x = self.out(x)
return x
In our Neural network we have used 14 input features and 2 linear hidden layers having 20 neurons each and 2 output features.
In the forward propagation we have used leaky_relu as the activation function
size_ = y_train.shape[0]
acc = 0
epochs = 1000
final_losses = []
for i in range(epochs):
i += 1
y_pred = model.forward(X_train)
loss = loss_function(y_pred,y_train)
final_losses.append(loss.item())
_, predicted = torch.max(y_pred, 1)
acc = (predicted == y_train).sum().item()
if i%100 == 1:
print("Epoch Number: {} and the loss: {} accuracy {}".format(i,loss.item(), acc/size_))
writer.add_scalar('Training loss', loss.item(), i)
writer.add_scalar('Accuracy', acc/size_, i)
acc = 0
optimizer.zero_grad()
loss.backward()
optimizer.step()
Epoch Number: 1 and the loss: 0.39511343836784363 accuracy 0.8388119472718171 Epoch Number: 101 and the loss: 0.395035058259964 accuracy 0.8389537794093108 Epoch Number: 201 and the loss: 0.3950144350528717 accuracy 0.838895377940931 Epoch Number: 301 and the loss: 0.3949779272079468 accuracy 0.8390372100784248 Epoch Number: 401 and the loss: 0.39475810527801514 accuracy 0.8390372100784248 Epoch Number: 501 and the loss: 0.3946821391582489 accuracy 0.839020523944602 Epoch Number: 601 and the loss: 0.39463287591934204 accuracy 0.8392040714166528 Epoch Number: 701 and the loss: 0.39447444677352905 accuracy 0.8392875020857667 Epoch Number: 801 and the loss: 0.3945179283618927 accuracy 0.8391706991490072 Epoch Number: 901 and the loss: 0.3944171071052551 accuracy 0.8391790422159185
predictions = []
with torch.no_grad():
for i,data in enumerate(X_test):
y_pred = model(data)
predictions.append(y_pred.argmax().item())
from sklearn.metrics import confusion_matrix
cm = confusion_matrix(y_test,predictions)
cm
array([[24600, 376],
[ 4559, 430]])
from sklearn.metrics import accuracy_score
score = accuracy_score(y_test,predictions)
score
0.8353078591690305
We received a test accuracy of 83.53% using the Neural network model
file:///N/u/vshriram/Carbonate/Desktop/tensorboard.png
There has been some data leakage in our implementation. I'll state these pointwise -
Families of Input Features -
numerical_features = train.select_dtypes(include = ['int64', 'float64']).columns
categorical_features = train.select_dtypes(include = ['object', 'bool']).columns
print("\n Numerical Features - {}".format(numerical_features))
print("\n Categorical Features - {}".format(categorical_features))
Numerical Features - Index(['SK_ID_CURR', 'TARGET', 'CNT_CHILDREN', 'AMT_INCOME_TOTAL',
'AMT_CREDIT', 'AMT_ANNUITY', 'AMT_GOODS_PRICE',
'REGION_POPULATION_RELATIVE', 'DAYS_BIRTH', 'DAYS_EMPLOYED',
'DAYS_REGISTRATION', 'DAYS_ID_PUBLISH', 'OWN_CAR_AGE', 'FLAG_MOBIL',
'FLAG_EMP_PHONE', 'FLAG_WORK_PHONE', 'FLAG_CONT_MOBILE', 'FLAG_PHONE',
'FLAG_EMAIL', 'CNT_FAM_MEMBERS', 'REGION_RATING_CLIENT',
'REGION_RATING_CLIENT_W_CITY', 'HOUR_APPR_PROCESS_START',
'REG_REGION_NOT_LIVE_REGION', 'REG_REGION_NOT_WORK_REGION',
'LIVE_REGION_NOT_WORK_REGION', 'REG_CITY_NOT_LIVE_CITY',
'REG_CITY_NOT_WORK_CITY', 'LIVE_CITY_NOT_WORK_CITY', 'EXT_SOURCE_1',
'EXT_SOURCE_2', 'EXT_SOURCE_3', 'AMT_REQ_CREDIT_BUREAU_HOUR',
'AMT_REQ_CREDIT_BUREAU_DAY', 'AMT_REQ_CREDIT_BUREAU_WEEK',
'AMT_REQ_CREDIT_BUREAU_MON', 'AMT_REQ_CREDIT_BUREAU_QRT',
'AMT_REQ_CREDIT_BUREAU_YEAR'],
dtype='object')
Categorical Features - Index(['NAME_CONTRACT_TYPE', 'CODE_GENDER', 'FLAG_OWN_CAR', 'FLAG_OWN_REALTY',
'NAME_TYPE_SUITE', 'NAME_INCOME_TYPE', 'NAME_EDUCATION_TYPE',
'NAME_FAMILY_STATUS', 'NAME_HOUSING_TYPE', 'OCCUPATION_TYPE',
'WEEKDAY_APPR_PROCESS_START', 'ORGANIZATION_TYPE'],
dtype='object')
print("Number of Input Features to the Neural network- ")
train_f1
Number of Input Features to the Neural network-
| TARGET | EXT_SOURCE_3 | EXT_SOURCE_2 | EXT_SOURCE_1 | DAYS_BIRTH | REGION_RATING_CLIENT_W_CITY | REGION_RATING_CLIENT | DAYS_ID_PUBLISH | REG_CITY_NOT_WORK_CITY | FLAG_EMP_PHONE | DAYS_EMPLOYED | ACTIVE_LOANS_PERCENTAGE | CREDIT_INCOME_RATIO | YEARS_TO_PAY | INCOME_ANNUITY | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0 | 0.713631 | 0.510759 | 0.858157 | -22456 | 2 | 3 | -3841 | 0 | 0 | 365243 | 0.181818 | 0.495050 | 10.0 | 5.0 |
| 1 | 0 | 0.396220 | 0.652577 | 0.764642 | -19506 | 2 | 2 | -2423 | 0 | 0 | 365243 | 0.294118 | 0.400000 | 21.0 | 9.0 |
| 2 | 0 | 0.689479 | 0.644868 | NaN | -10245 | 1 | 1 | -1881 | 0 | 1 | -2418 | 0.473684 | 0.179083 | 34.0 | 6.0 |
| 3 | 0 | NaN | 0.218719 | 0.294392 | -9349 | 3 | 3 | -22 | 0 | 1 | -1282 | 0.500000 | 0.974062 | 10.0 | 10.0 |
| 4 | 0 | 0.318596 | 0.433441 | NaN | -13464 | 3 | 3 | -3765 | 0 | 1 | -1405 | 0.181818 | 0.272727 | 20.0 | 5.0 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 149820 | 1 | 0.360613 | 0.329708 | 0.073452 | -9918 | 3 | 3 | -2580 | 0 | 1 | -3048 | 0.214286 | 1.472320 | 9.0 | 14.0 |
| 149821 | 1 | 0.424130 | 0.583214 | 0.634729 | -13416 | 2 | 2 | -4704 | 0 | 1 | -2405 | 0.214286 | 1.472320 | 9.0 | 14.0 |
| 149822 | 1 | 0.511892 | 0.713524 | NaN | -20644 | 2 | 2 | -3832 | 0 | 1 | -3147 | 0.214286 | 1.472320 | 9.0 | 14.0 |
| 149823 | 1 | 0.397946 | 0.615261 | NaN | -16471 | 2 | 2 | -9 | 0 | 1 | -286 | 0.214286 | 1.472320 | 9.0 | 14.0 |
| 149824 | 1 | 0.661024 | 0.514163 | NaN | -11961 | 2 | 2 | -931 | 1 | 1 | -4786 | 0.214286 | 1.472320 | 9.0 | 14.0 |
149825 rows × 15 columns
print("Count of Input Features - ", len(train_f1.columns) - 1)
Count of Input Features - 14
Hyperaparameters considered for Logistic Regression -
param_grid_lr = {
'C': [10**x for x in range(-3,4)],
'penalty': ['l1','l2']
}
Loss function used - Cross Entropy Loss
Cross Entropy Loss criterion computes the cross entropy loss between input and target.
It is useful when training a classification problem with C classes. If provided, the optional argument weight should be a 1D Tensor assigning weight to each of the classes. This is particularly useful when you have an unbalanced training set.
In our project, we do have a unbalanced training set as we have about 270000 target variables as 0 (i.e people who havent repaid their loan) among 300000 data points.
Equation -
Cross-entropy can be calculated using the probabilities of the events from P and Q, as follows:
H(P, Q) = – sum x in X P(x) * log(Q(x))
Where P(x) is the probability of the event x in P, Q(x) is the probability of event x in Q and log is the base-2 logarithm
from sklearn.metrics import confusion_matrix
cm = confusion_matrix(y_test,predictions)
cm
array([[24600, 376],
[ 4559, 430]])
from sklearn.metrics import accuracy_score
score = accuracy_score(y_test,predictions)
score
0.8353078591690305
In Phase 2, we mainly considered Logistic Regression and Random Forest Classifier as our candidate models based on last phases' result.
Additionally, as Feature Engineering was to be done in this particular phase, we derived 6 new candidate features from the given set of datasets out of which we selected 5 for our model training.
Including these Features helped increase the Test accuracy of Logistic Regression from 91.93% to 92.08% as it can be seen from the results.
This accuracy was before Hyperparameter Tuning. We then performed Hyperparameter Tuning to further improve our insights on the model.
HyperParameter tuning helped improve the Test Accuracy for Logistic Regression from 92.08% to 92.46%
Important Callout: The last feature that we did not select for our model training was Debt Overdue per Customer. Selecting this feature caused a sudden spike in our test accuracy from 92.08% to 97.00%. We believe that this sudden spike of accuracy is due to overfitting and thus we will keep the last test accuracy i.e 92.13% as our result from feature engineering.
Similarly for Random Forest Classifier, we were able to achieve test accuracy of 92.975%
Phase 3 -
In this Phase 3 we first improved our results from Phase 2 and we achieved a Kaggle submission this time of 0.73136 from our previous score of 0.6491.
We implemented deep learning in this phase and we employed a Multi-Layer Perceptron as our model.
We received the train accuracy of 83.91% after 1000 epochs and a test accuracy of 83.53%.
We then visualized our training model on TensorBoard and generated a plot of loss function and accuracy.
We also implemented a Single Layer perceptron model and achieved a slight less accurracy as compared to multi layer.
The main purpose of this project is to create a Machine Learning model that can predict whether or not a loan applicant will be able to repay the loan. Many worthy applicants with no credit history or default history are getting without any statistical analysis. The ML model in our work is trained using the HCDR dataset. It will be able to predict whether an applicant will be able to repay his loan or not based on the history of similar applicants in the past. This would help in filtering applicants with a good statistical backing derived from various factors that are taken into consideration. This would help both, a worthy applicant in securing a loan and the bank to grow their business further. We perfomed feature engineering, feature selection and hyperparameter tuning to improve our classification model to accurately predict whether the loan applicant is able to repay his loan or not. We identified that the Credit Annuity Ratio of the Application feature as the most important feature in our implementation. This feature is basically the ratio between the amount of loan credited to the annual annuity of the loan applicant. The result which we got after modeling and fine tuning provides confidence that it will be able to successfully predict applicants’ credit worthiness. Their might be some inaccurate feature selections in our work as we got a decreased score on our kaggle submission. We will be analyzing what features are causing this and add or remove some more features to improve our score.
This is the third iteration of our model and we improved the inaccurate feature selections thus also improving the Kaggle submission. We also implemented a deep learning algorithm - Multi-layer Perceptron and generated a classified model to aid our implementation. We made use of TensorBoard to visualize our training model in real-time. We found that by improving feature selection and balancing the data, we are achieving better results. We also found that Multi-layer Perceptron model performs better than Single Layer Perceptron model.
file:///N/u/vshriram/Carbonate/Desktop/kagglePhase3.png
Read the following: